I have 2 data frames sample, with names changed:
df1 =
| Comp_code | DepartmentListA | DepartmentListB |
|---|---|---|
| Code_1 | "Dept1" | "Dept3" |
| Code_2 | "Dept2" | "Dept4" |
| Code_3 | "Dept4, Dept5" | "Dept1" |
| Code_4 | "Dept1,Dept5, Dept6" | "Dept3, Dept4" |
df2 = with just department and revenue
| DepartmentList | Revenue | Gross Margin |
|---|---|---|
| "Dept1" | 1000 | 500 |
| "Dept2" | 2000 | 0 |
| "Dept3, | 5000 | 900 |
| "Dept4" | 5000 | 200 |
| "Dept5" | 7000 | -100 |
| "Dept6" | 8000 | 2500 |
I would like my final df to have company codes and total revenue and gross margin. Summing up totals of departments in column A and B. I am not able to iterate and join because of the comma separated string of departments. My final DF should be something like this
expected df =
| Comp_code | GrossRev | Tot Margin |
|---|---|---|
| Code_1 | 6000 | 1400 |
| Code_2 | 7000 | 200 |
| Code_3 | 13000 | 600 |
| Code_4 | 26000 | 4000 |
Also the data frames are both a couple of million rows and some department lists (comma separated values) are about 100 in number. Would be good if there was a performant way of doing this.
CodePudding user response:
This code works. It's very long, but it's mostly just repetitive.
new_df = df1[['Comp_code']].copy()
new_df['GrossRev'] = df1['DepartmentListB'].str.split(',').explode().map(df2.set_index('DepartmentList')['Revenue']).groupby(level=0).sum() df1['DepartmentListA'].str.split(',').explode().map(df2.set_index('DepartmentList')['Revenue']).groupby(level=0).sum()
new_df['Tot Margin'] = df1['DepartmentListB'].str.split(',').explode().map(df2.set_index('DepartmentList')['GrossMargin']).groupby(level=0).sum() df1['DepartmentListA'].str.split(',').explode().map(df2.set_index('DepartmentList')['GrossMargin']).groupby(level=0).sum()
Output:
>>> new_df
Comp_code GrossRev Tot Margin
0 Code_1 6000 1400
1 Code_2 7000 200
2 Code_3 13000 600
3 Code_4 26000 4000
