Python pandas. Need to aggregate if column value is in another comma separated value column of secon-CodePudding

I have 2 data frames sample, with names changed

df1 = 
| Comp_code | DepartmentListA | DepartmentListB |
| --------  | --------------- | --------------- |
| Code_1    | "Dept1"         | "Dept3"         |
| Code_2    | "Dept2"         | "Dept4"         |
| Code_3    | "Dept4, Dept5"  | "Dept1"         |
| Code_4    | "Dept1,Dept5, Dept6"  | "Dept3, Dept4"  |

df2 = with just department and revenue

| DepartmentList | Revenue|  Gross Margin|
| --------------- | -------| -------|
| "Dept1"         | 1000         | 500
| "Dept2"         | 2000         | 0
| "Dept3,         | 5000         | 900
| "Dept4"         | 5000         | 200
| "Dept5"         | 7000         | -100
| "Dept6"         | 8000         | 2500

I would like my final df to have company codes and total revenue and gross margin. Summing up totals of departments in column A and B. I am not able to iterate and join because of the comma separated string of departments. My final DF should be something like this

expected df =

| Comp_code | GrossRev| Tot Margin|
| --------  | --------------- | --------------- |
| Code_1    | 6000         | 1400         |
| Code_2    | 7000         | 200         |
| Code_3    | 13000  | 600         |
| Code_4    | 26000  | 4000  |

Also the data frames are both a couple of million rows and some department lists (comma separated values) are about 100 in number. Would be good if there was a performant way of doing this.

TIA,

CodePudding user response：

You can do that with the following

df["all"] = ((df["DepartmentListA"]   
              ", "   
              df["DepartmentListB"]).str.split("[\\s,] "))

df3 = df.explode("all")

df3 = (pd.merge(df3, df2, left_on='all', 
               right_on='Comp_code', suffixes=('', '_y'))
               .groupby("Comp_code")
                .agg({'Revenue': ['sum'], 'Gross Margin': 'sum'}).reset_index())

df3.columns = df3.columns.droplevel(-1)
print(df3)

  Comp_code  Revenue  Gross Margin
0    Code_1     6000          1400
1    Code_2     7000           200
2    Code_3    13000           600
3    Code_4    26000          4000

CodePudding user response：

You can melt explode to get individual row per department, then merge with df2 to get the data, and finally groupby sum to aggregate the sum:

(df1.melt(id_vars='Comp_code')
    .assign(value=lambda d: d['value'].str.split(r',\s*'))
    .explode('value')
    .merge(df2, left_on='value', right_on='DepartmentList')
    .groupby('Comp_code').agg({'Revenue': 'sum', 'Gross Margin': 'sum'})
)

output:

           Revenue  Gross Margin
Comp_code                       
Code_1        6000          1400
Code_2        7000           200
Code_3       13000           600
Code_4       26000          4000