Home > Blockchain >  Pandas json normalize why it returns NaN for repeated values
Pandas json normalize why it returns NaN for repeated values

Time:01-19

I have a dataframe where 2 columns are lists of dictionaries that I want to expand to separate columns. For example:

 id    text                   agg_inds                                           agg_tars
 1     some text    [{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}]    [{"f1": [5, 6, 12], "f2": "2984"}]

I want to create 2 columns for the nested column agg_inds named ind_pos and ind_id and 2 different columns for agg_tars named tar_pos and tar_id.

The problem with using json_normalize is that it returns NaN values when a value repeat itsself. For example , in the above row, I would like to have this:

desired output

id  text        ind_pos     ind_id         tar_pos        tar_ind
1               [15]        2263           [5, 6, 12]     2984    
1   some text   [16]        2171           [5, 6, 12]     2984    

But here is the current output:

id  tex  ind_pos      ind_id        tar_pos        tar_ind
1         [15]        2263          [5, 6, 12]     2984    
1   NaN   [16]        2171          NaN            NaN    

Here's the code :

s = (df.set_index('id')
          .apply(lambda x: x.apply(pd.Series).stack())
          .reset_index()
          .drop('level_1', 1))

s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)

s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)

s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)

CodePudding user response:

You can use ffill to replace the NaN values with the value of the previous row. For this you only have to add one line:

s = (df.set_index('id')
          .apply(lambda x: x.apply(pd.Series).stack())
          .reset_index()
          .drop('level_1', 1))

s.ffill(inplace=True)

s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)

s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)

s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)

Output:

   id        text ind_pos ind_id     tar_pos tar_id
0   1   some text    [15]   2263  [5, 6, 12]   2984
1   1   some text    [16]   2171  [5, 6, 12]   2984

CodePudding user response:

If I have an initial dataset like this:

# import libraries
import pandas as pd
import json

# read data
df = pd.DataFrame({
    'id': [1],
    'text': ['some text'],
    'agg_inds': ['[{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}]'],
    'agg_tars': ['[{"f1": [5, 6, 12], "f2": "2984"}]'],
})

# convert columns
for col in ['agg_inds', 'agg_tars']:
    df[col] = df[col].apply(lambda x: json.loads(x))

# set id column as index
df = df.set_index('id')

then I can reuse your processing logic in order to create a function that extracts the both types of features, and creates a data frame out of it:

def extract_features(col: str, feat_new_names: dict):
    return (
        df[col]
        .apply(pd.Series)
        .stack()
        .apply(pd.Series)
        .reset_index()
        .drop(['level_1'], axis=1)
        .set_index('id')
        .rename(columns=feat_new_names)
    )

df_agg_inds = extract_features(col='agg_inds', feat_new_names={'f1': 'ind_pos', 'f2': 'ind_id'})
df_agg_tars = extract_features(col='agg_tars', feat_new_names={'f1': 'tar_pos', 'f2': 'tar_id'})

Because all of the data frames have got the id column set as an index, I can use outer join to put them all together without loosing any information:

df_final = (
    pd
    .concat([df, df_agg_inds, df_agg_tars], axis=1)
    .drop(['agg_inds', 'agg_tars'], axis=1)
)

And the result looks like this:

>>> print(df_final)
         text ind_pos ind_id     tar_pos tar_id
id                                             
1   some text    [15]   2263  [5, 6, 12]   2984
1   some text    [16]   2171  [5, 6, 12]   2984

I hope this will help <3

  •  Tags:  
  • Related