I'm trying to convert a JSON file structured like this:
oneD = {'Record':
{'first': ['A', 'B',
{'inter': ['1', '2', '3']}],
'second': ['C', 'D', 'E']},
'Record2':
{'first': ['G', 'H',
{'inter': ['5', '6']}],
'second': ['I', 'J', 'K']
}}
And I would ultimately like to end up with a pandas dataframe like this:
import pandas as pd
data = {'Title': ['Record', 'Record', 'Record', 'Record2', 'Record2', 'Record2'],
'First': ['A', 'B', 'NA', 'G', 'H', 'NA'],
'Inter': ['1', '2', '3', '5', '6', 'NA'],
'Second':['C', 'D', 'E', 'I', 'J', 'K']
}
df=pd.DataFrame(data)
Title First Inter Second
0 Record A 1 C
1 Record B 2 D
2 Record NA 3 E
3 Record2 G 5 I
4 Record2 H 6 J
5 Record2 NA NA K
I tried the following as suggested here (pandas dataframe from nested JSON with lists):
df = pd.DataFrame.from_dict(oneD, orient="index")
df2 = pd.concat([pd.DataFrame(df[i].values.tolist(),
columns=[f"{i}_{num}" for num in range(len(df[i].iat[0]))]
) for i in df.columns],axis=1)
Unfortunately, this produces:
first_0 first_1 first_2 second_0 second_1 second_2 second_3
0 A B {'inter': ['1', '2', '3']} C D E F
1 G H {'inter': ['5', '6']} I J K None
Can anyone offer any suggestions? I'm losing my mind in a series of nested loops trying to convert the full JSON file.
CodePudding user response:
Use df.T, df.apply, pd.Series.explode and pd.concat:
# Create a df from oneD and transpose it.
In [205]: x = pd.DataFrame(oneD).T.apply(pd.Series.explode).rename_axis('Title').reset_index()
# Convert rows with dict in 'first' to a new column 'inter'
In [210]: x[['first', 'inter']] = x['first'].apply(pd.Series)
# Group on 'Title' and explode the 'inter' column and create a new df
In [212]: inter = x.groupby('Title')['inter'].first().explode().reset_index()
# Concat the two df's 'x' and 'inter' to get the desired result
In [216]: res = pd.concat([x[['Title', 'first', 'second']], inter['inter']], 1).fillna('NA')
In [217]: res
Out[217]:
Title first second inter
0 Record A C 1
1 Record B D 2
2 Record NA E 3
3 Record2 G I 5
4 Record2 H J 6
5 Record2 NA K NA
CodePudding user response:
You could "reshape" the data before passing it to pandas.
import itertools
oneD = {...}
rows = []
for name, item in oneD.items():
item['inter'] = item['first'].pop()['inter']
row = [[name, *row] for row in
itertools.zip_longest(*item.values())]
rows.extend(row)
>>> rows
[['Record', 'A', 'C', '1'],
['Record', 'B', 'D', '2'],
['Record', None, 'E', '3'],
['Record2', 'G', 'I', '5'],
['Record2', 'H', 'J', '6'],
['Record2', None, 'K', None]]
>>> pd.DataFrame(rows, columns=[name] list(item))
Record2 first second inter
0 Record A C 1
1 Record B D 2
2 Record None E 3
3 Record2 G I 5
4 Record2 H J 6
5 Record2 None K None
