I have a dataset
Name System
A AZ
A NaN
B AZ
B NaN
B NaN
C AY
C AY
D AZ
E AY
E AY
E NaN
F AZ
F AZ
F NaN
Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".
In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.
Output Example:
Cluster Names
AZ A,B,Z
AY,AY C,E
AZ,AZ F
How can I do it using Python?
PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?
CodePudding user response:
Use groupby agg twice; once to join "Systems" and then to join "Names":
s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})
Output:
Cluster Name
0 AY, AY C, E
1 AZ A, B, D
2 AZ, AZ F
CodePudding user response:
If ordering per groups is same use double groupby by Name and then by System columns:
df1 = (df.dropna(subset=['System'])
.groupby('Name')['System']
.agg(','.join)
.reset_index()
.groupby('System')['Name']
.agg(','.join)
.rename_axis('Cluster')
.reset_index())
print (df1)
Cluster Name
0 AY,AY C,E
1 AZ A,B,D
2 AZ,AZ F
If ordering should be different, so sort values is necessary use:
df1 = (df.dropna(subset=['System'])
.sort_values(['Name','System'])
.groupby('Name')['System'].agg(','.join)
.reset_index()
.groupby('System')['Name']
.agg(','.join)
.rename_axis('Cluster')
.reset_index())
