Clustering in DataFrame using Python-CodePudding

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.

Output Example:

Cluster     Names
AZ          A,B,Z
AY,AY       C,E
AZ,AZ       F

How can I do it using Python?

PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?

CodePudding user response：

Use groupby agg twice; once to join "Systems" and then to join "Names":

s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})

Output:

  Cluster     Name
0  AY, AY     C, E
1      AZ  A, B, D
2  AZ, AZ        F

CodePudding user response：

If ordering per groups is same use double groupby by Name and then by System columns:

df1 = (df.dropna(subset=['System'])
         .groupby('Name')['System']
         .agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())

print (df1)
  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

If ordering should be different, so sort values is necessary use:

df1 = (df.dropna(subset=['System'])
         .sort_values(['Name','System'])
         .groupby('Name')['System'].agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())