I have a dataframe where 2 columns are lists of dictionaries that I want to expand to separate columns. For example:
id text agg_inds agg_tars
1 some text [{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}] [{"f1": [5, 6, 12], "f2": "2984"}]
I want to create 2 columns for the nested column agg_inds named ind_pos and ind_id and 2 different columns for agg_tars named tar_pos and tar_id.
The problem with using json_normalize is that it returns NaN values when a value repeat itsself. For example , in the above row, I would like to have this:
desired output
id text ind_pos ind_id tar_pos tar_ind
1 [15] 2263 [5, 6, 12] 2984
1 some text [16] 2171 [5, 6, 12] 2984
But here is the current output:
id tex ind_pos ind_id tar_pos tar_ind
1 [15] 2263 [5, 6, 12] 2984
1 NaN [16] 2171 NaN NaN
Here's the code :
s = (df.set_index('id')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)
s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)
s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)
CodePudding user response:
You can use ffill to replace the NaN values with the value of the previous row. For this you only have to add one line:
s = (df.set_index('id')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
s.ffill(inplace=True)
s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)
s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)
s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)
Output:
id text ind_pos ind_id tar_pos tar_id
0 1 some text [15] 2263 [5, 6, 12] 2984
1 1 some text [16] 2171 [5, 6, 12] 2984
CodePudding user response:
If I have an initial dataset like this:
# import libraries
import pandas as pd
import json
# read data
df = pd.DataFrame({
'id': [1],
'text': ['some text'],
'agg_inds': ['[{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}]'],
'agg_tars': ['[{"f1": [5, 6, 12], "f2": "2984"}]'],
})
# convert columns
for col in ['agg_inds', 'agg_tars']:
df[col] = df[col].apply(lambda x: json.loads(x))
# set id column as index
df = df.set_index('id')
then I can reuse your processing logic in order to create a function that extracts the both types of features, and creates a data frame out of it:
def extract_features(col: str, feat_new_names: dict):
return (
df[col]
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index()
.drop(['level_1'], axis=1)
.set_index('id')
.rename(columns=feat_new_names)
)
df_agg_inds = extract_features(col='agg_inds', feat_new_names={'f1': 'ind_pos', 'f2': 'ind_id'})
df_agg_tars = extract_features(col='agg_tars', feat_new_names={'f1': 'tar_pos', 'f2': 'tar_id'})
Because all of the data frames have got the id column set as an index, I can use outer join to put them all together without loosing any information:
df_final = (
pd
.concat([df, df_agg_inds, df_agg_tars], axis=1)
.drop(['agg_inds', 'agg_tars'], axis=1)
)
And the result looks like this:
>>> print(df_final)
text ind_pos ind_id tar_pos tar_id
id
1 some text [15] 2263 [5, 6, 12] 2984
1 some text [16] 2171 [5, 6, 12] 2984
I hope this will help <3
