I want convert the nested dictionary list into data frame. but Nested dictionary each key's List has different length. for instance, the below nested dictionary keys x's,y's,z's list has different lengths.
dict_example = {
'a': {
'x': ['f1', 'f2', 'f6'],
'y': ['f6', 'f9', 'f2', 'f8'],
'z': ['f1', 'f9', 'f2', 'f8', 'f6', 'f10', 'f3']
},
'b': {
'x': ['f1', 'f2', 'f6', 'f4'],
'y': ['f6', 'f9', 'f2', 'f8', 'f17','f12'],
'z': ['f1', 'f9', 'f2', 'f10', 'f3']
}
}
from tabulate import tabulate
import pandas as pd
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dict_example.items()]))
print(tabulate(df, headers='keys', tablefmt='psql'))
the above code is giving the output like the below table. But i want to get the each value of the list into a row that belongs to nested key. key a and b are main columns.
---- --------------------------------------------- ---------------------------------
| | a | b |
|---- --------------------------------------------- ---------------------------------|
| x | ['f1', 'f2', 'f6'] | ['f1', 'f2', 'f6', 'f4'] |
| y | ['f6', 'f9', 'f2', 'f8'] | ['f6', 'f9', 'f2', 'f8', 'f17'] |
| z | ['f1', 'f9', 'f2', 'f8', 'f6', 'f10', 'f3'] | ['f1', 'f9', 'f2', 'f10', 'f3'] |
---- --------------------------------------------- ---------------------------------
The output data frame should look like this.
a b
x f1 f1
f2 f2
f6 f6
NaN f4
y f6 f6
f9 f9
f2 f2
f8 f8
NaN f17
z f1 f1
f9 f9
f2 f2
f8 f10
f6 f3
f10 NaN
f3 NaN
CodePudding user response:
Use dict comprehension with concat and DataFrame.stack:
d = {k: pd.DataFrame({k1:pd.Series(v1) for k1, v1 in v.items()})
for k,v in dict_example.items()}
df = pd.concat(d, axis=1).stack()
print (df)
a b
x 0 f1 f1
1 f2 f2
2 f6 f6
3 NaN f4
y 0 f6 f6
1 f9 f9
2 f2 f2
3 f8 f8
4 NaN f17
5 NaN f12
z 0 f1 f1
1 f9 f9
2 f2 f2
3 f8 f10
4 f6 f3
5 f10 NaN
6 f3 NaN
CodePudding user response:
This approach seems pretty fast using itertools and numpy:
import pandas as pd
import numpy as np
import itertools
def process(a, b):
x = np.array(list(itertools.zip_longest(*[a, b],fillvalue=np.nan))).T
return x[0], x[1]
df = pd.DataFrame(dict_example)
df[['a','b']] = df.apply(lambda x: process(x.a, x.b), axis=1, result_type='expand')
df = df.explode(['a', 'b'])
a b
x f1 f1
x f2 f2
x f6 f6
x nan f4
y f6 f6
y f9 f9
y f2 f2
y f8 f8
y nan f17
y nan f12
z f1 f1
z f9 f9
z f2 f2
z f8 f10
z f6 f3
z f10 nan
z f3 nan
Or even faster with to_numpy():
def process(A, B):
a_col, b_col = [], []
for a,b in zip(A,B):
x = np.array(list(itertools.zip_longest(*[a, b],fillvalue=np.nan))).T
a_col.append(x[0])
b_col.append(x[1])
return a_col, b_col
df = pd.DataFrame(dict_example)
df['a'], df['b'] = process(df['a'].to_numpy(), df['b'].to_numpy())
df = df.explode(['a', 'b'])
