I need some help concatenating some strings from a DF.
I have the below "original" df:
df = pd.DataFrame({'Material': [1,1,1,2,2,2,3,3,3,3],
'BOM': ['A','B',np.nan,'A',np.nan,'C','A','A','B','C']})
| Material | BOM | |
|---|---|---|
| 0 | 1 | A |
| 1 | 1 | B |
| 2 | 1 | NaN |
| 3 | 2 | A |
| 4 | 2 | NaN |
| 5 | 2 | C |
| 6 | 3 | A |
| 7 | 3 | A |
| 8 | 3 | B |
| 9 | 3 | C |
Expected Result is:
| Material | BOM |
|---|---|
| 1 | A, B |
| 2 | A, C |
| 3 | A, B, C |
I'm new with Python and am thinking of doing it with a mix of Loop and Dict but definitely think there must be a better way.
The tricky bit is that I think I need to group by material and check column "BOM" for the distinct options (different than NaN) then some sort of ", ".join them.
I've tried the following:
df.groupby('Material')['BOM'].apply(', '.join)
And got the error: "TypeError: sequence item 2: expected str instance, float found"
If anyone could help me I would be really grateful
CodePudding user response:
You can dropna and drop_duplicates, then GroupBy.agg:
df.dropna().drop_duplicates().groupby('Material').agg(', '.join)
or:
df.dropna().groupby('Material').agg(lambda s: ', '.join(s.drop_duplicates()))
output:
BOM
Material
1 A, B
2 A, C
3 A, B, C
CodePudding user response:
3 solutions depending on output requirements:
Using .agg:
df.dropna().drop_duplicates().groupby('Material').agg(', '.join)
Output:
BOM
Material
1 A, B
2 A, C
3 A, B, C
Using .apply(list):
df.dropna().drop_duplicates().groupby('Material')['BOM'].apply(list).reset_index()
Output:
Material BOM
0 1 [A, B]
1 2 [A, C]
2 3 [A, B, C]
Using .apply(set): df.dropna().groupby('Material')['BOM'].apply(set).reset_index()
Output:
Material BOM
0 1 {A, B}
1 2 {A, C}
2 3 {A, C, B}
