I have 2 data frames sample, with names changed
df1 =
| Comp_code | DepartmentListA | DepartmentListB |
| -------- | --------------- | --------------- |
| Code_1 | "Dept1" | "Dept3" |
| Code_2 | "Dept2" | "Dept4" |
| Code_3 | "Dept4, Dept5" | "Dept1" |
| Code_4 | "Dept1,Dept5, Dept6" | "Dept3, Dept4" |
df2 = with just department and revenue
| DepartmentList | Revenue| Gross Margin|
| --------------- | -------| -------|
| "Dept1" | 1000 | 500
| "Dept2" | 2000 | 0
| "Dept3, | 5000 | 900
| "Dept4" | 5000 | 200
| "Dept5" | 7000 | -100
| "Dept6" | 8000 | 2500
I would like my final df to have company codes and total revenue and gross margin. Summing up totals of departments in column A and B. I am not able to iterate and join because of the comma separated string of departments. My final DF should be something like this
expected df =
| Comp_code | GrossRev| Tot Margin|
| -------- | --------------- | --------------- |
| Code_1 | 6000 | 1400 |
| Code_2 | 7000 | 200 |
| Code_3 | 13000 | 600 |
| Code_4 | 26000 | 4000 |
Also the data frames are both a couple of million rows and some department lists (comma separated values) are about 100 in number. Would be good if there was a performant way of doing this.
TIA,
CodePudding user response:
You can do that with the following
df["all"] = ((df["DepartmentListA"]
", "
df["DepartmentListB"]).str.split("[\\s,] "))
df3 = df.explode("all")
df3 = (pd.merge(df3, df2, left_on='all',
right_on='Comp_code', suffixes=('', '_y'))
.groupby("Comp_code")
.agg({'Revenue': ['sum'], 'Gross Margin': 'sum'}).reset_index())
df3.columns = df3.columns.droplevel(-1)
print(df3)
Comp_code Revenue Gross Margin
0 Code_1 6000 1400
1 Code_2 7000 200
2 Code_3 13000 600
3 Code_4 26000 4000
CodePudding user response:
You can melt explode to get individual row per department, then merge with df2 to get the data, and finally groupby sum to aggregate the sum:
(df1.melt(id_vars='Comp_code')
.assign(value=lambda d: d['value'].str.split(r',\s*'))
.explode('value')
.merge(df2, left_on='value', right_on='DepartmentList')
.groupby('Comp_code').agg({'Revenue': 'sum', 'Gross Margin': 'sum'})
)
output:
Revenue Gross Margin
Comp_code
Code_1 6000 1400
Code_2 7000 200
Code_3 13000 600
Code_4 26000 4000
