convert dataframe to nested dictionary (one nesting level per column)-CodePudding

I have this dataframe:

COL0    COL1    COL2    COL3
----------------------------
   A      A1     A11    A111
   A      A1     A11    A112
   A      A1     A12    A113
   A      A1     A12    A114
   A      A2     A13    A115
   A      A2     A13    A116
   A      A2     A14    A117
   A      A2     A14    A118

And I would like to obtain a dictionary like the below from it. If I just apply the to_dict() method to the original dataframe, the format is not what I would like.

{
    'A':{
        'A1':{
            'A11':['A111', 'A112'],
            'A12':['A113', 'A114']
        },
        'A2':{
            'A13':['A115', 'A116'],
            'A13':['A117', 'A118']            
        }
    }
}

PS: snippet to generate the above dataframe:

df = pd.DataFrame(
{
    'COL0': ['A']*8,
    'COL1': ['A1']*4   ['A2']*4,
    'COL2': ['A11']*2   ['A12']*2   ['A12']*2   ['A13']*2,
    'COL3': [f'A11{i 1}' for i in range(8)]
})

EDIT:

TypeError Traceback (most recent call last) in 1 {a: {k: f.groupby('COL2')['COL3'].apply(list).to_dict() 2 for k, f in g.groupby('COL1')} ----> 3 for a, g in df.groupby('COL0')}

in (.0) 1 {a: {k: f.groupby('COL2')['COL3'].apply(list).to_dict() 2 for k, f in g.groupby('COL1')} ----> 3 for a, g in df.groupby('COL0')}

in (.0) 1 {a: {k: f.groupby('COL2')['COL3'].apply(list).to_dict() ----> 2 for k, f in g.groupby('COL1')} 3 for a, g in df.groupby('COL0')}

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in apply(self, func, *args, **kwargs) 219 ) 220 def apply(self, func, *args, **kwargs): --> 221 return super().apply(func, *args, **kwargs) 222 223 @doc(_agg_template, examples=_agg_examples_doc, klass="Series")

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs) 865 def apply(self, func, *args, **kwargs): 866 --> 867 func = self._is_builtin_func(func) 868 869 # this is needed so we don't try and wrap strings. If we could

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in _is_builtin_func(self, arg) 342 otherwise return the arg 343 """ --> 344 return self._builtin_table.get(arg, arg) 345 346

TypeError: unhashable type: 'list'

EDIT2:

TypeError Traceback (most recent call last) in 1 out = {} ----> 2 for keys, v in df.groupby(list(df.columns[:-1]))[df.columns[-1]]: 3 d = out # restart at root 4 val = v.to_list() 5 for k in keys:

TypeError: 'list' object is not callable

CodePudding user response：

Iterate over the rows and collect key-value pairs; use dict.setdefault to set a default value if a key doesn't exist yet:

out = {}
for w,x,y,z in df.to_numpy():
    out.setdefault(w, {}).setdefault(x, {}).setdefault(y, []).append(z)

With pd.groupby and a nested dict comprehension:

out = {a: {k: f.groupby('COL2')['COL3'].apply(list).to_dict()
           for k, f in g.groupby('COL1')} 
       for a, g in df.groupby('COL0')}

Output:

{'A': {'A1': {'A11': ['A111', 'A112'], 'A12': ['A113', 'A114']},
  'A2': {'A13': ['A115', 'A116'], 'A14': ['A117', 'A118']}}}

CodePudding user response：

Here is a way that works with an arbitrary number of columns.

It uses first groupby to set the groups per all but the last column.

Then it goes down the tree, changing the reference to the parent until it reaches the last key.

out = {}
for keys, v in df.groupby(list(df.columns[:-1]))[df.columns[-1]]:
    d = out             # restart at root
    val = v.to_list()
    for k in keys:
        if k not in d:
            d[k] = {}   # create child if missing
        parent = d
        d = d[k]        # go down in nested level
    parent[k] = val     # we reached the bottom, set the value

output:

{'A': {'A1': {'A11': ['A111', 'A112'],
              'A12': ['A113', 'A114']},
       'A2': {'A13': ['A115', 'A116'],
              'A14': ['A117', 'A118']},
       },
}

Example on further nested dictionary:

from  itertools import product
df = (pd.DataFrame(product('AB', '12', '123', '1', '12'))
        .add_prefix('COL').cumsum(1)
      )
#    COL0 COL1 COL2  COL3   COL4
#  0    A   A1  A11  A111  A1111
#  1    A   A1  A11  A111  A1112
#  2    A   A1  A12  A121  A1211
#  3    A   A1  A12  A121  A1212
#  4    A   A1  A13  A131  A1311
# ...
# 23    B   B2  B23  B231  B2312

output:

{'A': {'A1': {'A11': {'A111': ['A1111', 'A1112']},
              'A12': {'A121': ['A1211', 'A1212']},
              'A13': {'A131': ['A1311', 'A1312']}},
       'A2': {'A21': {'A211': ['A2111', 'A2112']},
              'A22': {'A221': ['A2211', 'A2212']},
              'A23': {'A231': ['A2311', 'A2312']}}},
 'B': {'B1': {'B11': {'B111': ['B1111', 'B1112']},
              'B12': {'B121': ['B1211', 'B1212']},
              'B13': {'B131': ['B1311', 'B1312']}},
       'B2': {'B21': {'B211': ['B2111', 'B2112']},
              'B22': {'B221': ['B2211', 'B2212']},
              'B23': {'B231': ['B2311', 'B2312']}}}}