Home > Software engineering >  Simple clustering in panda dataframe
Simple clustering in panda dataframe

Time:01-15

I have a dataframe with the following data:

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C'],
                   'col2': ['action1', 'action2', 'action1', 'action3', 'action1', 'action2']})

which looks like

col1, col2
A   , action1
A   , action2
B   , action1
B   , action3
C   , action1
C   , action2

Now since A and C both have action1 and action2. Group them together. B will be a separate group. So i want to generate a data frame below:

col1, col2
A, C, action1, action2
B   , action1, action3

How can I achieve this?

CodePudding user response:

If ordering per groups is same in col2 is possible aggregate join per col1 and then per joined columns:

df = df.groupby('col1')['col2'].agg(', '.join).reset_index()
df = df.groupby('col2')['col1'].agg(', '.join).reset_index()[['col1','col2']]
print (df)
   col1              col2
0  A, C  action1, action2
1     B  action1, action3

Or if ordering should be different use frozenset:

df = df.groupby('col1')['col2'].agg(frozenset).reset_index()
df = (df.groupby('col2')['col1']
        .agg(', '.join)
        .rename(lambda x: ', '.join(x)).reset_index()[['col1','col2']])
print (df)
   col1              col2
0  A, C  action2, action1
1     B  action1, action3


print (df)
  col1     col2
0    A  action1
1    A  action2
2    B  action1
3    B  action3
4    C  action2 <-changed order
5    C  action1 <-changed order

df = df.groupby('col1')['col2'].agg(frozenset).reset_index()
df = (df.groupby('col2')['col1']
        .agg(', '.join)
        .rename(lambda x: ', '.join(x)).reset_index()[['col1','col2']])
print (df)
   col1              col2
0  A, C  action2, action1
1     B  action1, action3

CodePudding user response:

Use groupby twice. Once to join col2 by col1 and then later to join col1 by the joined col2.

srs = df.groupby('col1')['col2'].apply(','.join)
srs = pd.Series(srs.index, index=srs)
out = srs.groupby(level=0).apply(','.join)
out = pd.Series(out.index, index=out).to_frame().reset_index()

Output:

  col1             col2
0  A,C  action1,action2
1    B  action1,action3
  •  Tags:  
  • Related