Home > database >  Parsing nest JSON to Pandas Dataframe
Parsing nest JSON to Pandas Dataframe

Time:01-29

I'm trying to convert a JSON file structured like this:

oneD = {'Record':
            {'first': ['A', 'B', 
                    {'inter': ['1', '2', '3']}],
                     'second': ['C', 'D', 'E']},
        'Record2':
            {'first': ['G', 'H',
                    {'inter': ['5', '6']}],
                     'second': ['I', 'J', 'K']
            }}

And I would ultimately like to end up with a pandas dataframe like this:

import pandas as pd

data = {'Title': ['Record', 'Record', 'Record', 'Record2', 'Record2', 'Record2'],
    'First': ['A',      'B',      'NA',     'G',       'H',    'NA'],
    'Inter': ['1',      '2',       '3',     '5',       '6',   'NA'],
    'Second':['C',      'D',       'E',     'I',       'J',   'K']
    }

 df=pd.DataFrame(data)
    Title First Inter Second
0   Record     A     1      C
1   Record     B     2      D
2   Record    NA     3      E
3  Record2     G     5      I
4  Record2     H     6      J
5  Record2    NA    NA      K

I tried the following as suggested here (pandas dataframe from nested JSON with lists):

df = pd.DataFrame.from_dict(oneD, orient="index")
df2 = pd.concat([pd.DataFrame(df[i].values.tolist(), 
                          columns=[f"{i}_{num}" for num in range(len(df[i].iat[0]))]
                          ) for i in df.columns],axis=1)

Unfortunately, this produces:

 first_0 first_1 first_2                   second_0 second_1 second_2 second_3
 0       A       B   {'inter': ['1', '2', '3']}   C        D        E        F
 1       G       H   {'inter': ['5', '6']}        I        J        K     None

Can anyone offer any suggestions? I'm losing my mind in a series of nested loops trying to convert the full JSON file.

CodePudding user response:

Use df.T, df.apply, pd.Series.explode and pd.concat:

# Create a df from oneD and transpose it.
In [205]: x = pd.DataFrame(oneD).T.apply(pd.Series.explode).rename_axis('Title').reset_index()

# Convert rows with dict in 'first' to a new column 'inter'
In [210]: x[['first', 'inter']] = x['first'].apply(pd.Series)

# Group on 'Title' and explode the 'inter' column and create a new df
In [212]: inter = x.groupby('Title')['inter'].first().explode().reset_index()

# Concat the two df's 'x' and 'inter' to get the desired result
In [216]: res = pd.concat([x[['Title', 'first', 'second']], inter['inter']], 1).fillna('NA')

In [217]: res
Out[217]: 
     Title first second inter
0   Record     A      C     1
1   Record     B      D     2
2   Record    NA      E     3
3  Record2     G      I     5
4  Record2     H      J     6
5  Record2    NA      K    NA

CodePudding user response:

You could "reshape" the data before passing it to pandas.

 import itertools

 oneD = {...}

 rows = []
 for name, item in oneD.items():
     item['inter'] = item['first'].pop()['inter']
     row = [[name, *row] for row in
                 itertools.zip_longest(*item.values())]
     rows.extend(row)
>>> rows
[['Record', 'A', 'C', '1'],
 ['Record', 'B', 'D', '2'],
 ['Record', None, 'E', '3'],
 ['Record2', 'G', 'I', '5'],
 ['Record2', 'H', 'J', '6'],
 ['Record2', None, 'K', None]]

>>> pd.DataFrame(rows, columns=[name]   list(item))
   Record2 first second inter
0   Record     A      C     1
1   Record     B      D     2
2   Record  None      E     3
3  Record2     G      I     5
4  Record2     H      J     6
5  Record2  None      K  None
  •  Tags:  
  • Related