Pandas Group by certain condition-CodePudding

my dataframe looks like this:

Id	name	size	metric	week1	week2	... until week 20
54	foo	1	A	20	20	....
54	foo	1	B	0	0	....
54	foo	1	C	0.39	0.39	....
54	foo	6	A	40	40	....
54	foo	6	B	0.50	0.50	....
54	foo	6	C	0.39	0.39	....

So I have 6 rows for every id, but instead I want only 3 rows. I want to merge row 1 with row 4, row 2 with row 5 and row 3 with row 6.

So my goal is to get something like this:

Id	name	size	metric	week1	week2	... until week 20
54	foo	6	A	60	60	....
54	foo	6	B	0.50	0.50	....
54	foo	6	C	0.39	0.39	....

-> For row with "metric" A, sum every "week"-column.
-> For row with "metric" B and C, check if values are equal (if not: get the higher value)

I already tried something with:

df.groupby(["id","metric"])[df.columns[4:].sum()

But then all rows with "metric" B and C are sum as well.

And for the size:
Keep the size where the week values (of metric A) are bigger. So rows with size 1 have values of 20 and rows with size 6 have values of 40 -> Therefore keep 6 as size.

I know there is a solution with nested loops, but I am pretty sure this can be done with pandas.

CodePudding user response：

You can use below code to generate the expected output, assuming that size 6 coming bcz 6 is the max value.

data = {"Id": [54, 54, 54, 54, 54, 54],"name": ["foo", "foo","foo","foo","foo","foo"],"size":[1,1,1,6,6,6], 
    "metrics":["A", "B", "C", "A", "B", "C"], "week1": [20, 0, 0.39, 40, 0.50, 0.39],
    "week2": [20, 0, 0.39, 40, 0.50, 0.39]
   }


 df.groupby(["metrics"]).agg({"Id":max, "name":max, "size":max, "week1":sum, "week2":sum})

OutPut

metrics	Id	name	size	week1	week2
A	54	foo	6	60.00	60.00
B	54	foo	6	0.50	0.50
C	54	foo	6	0.78	0.78

CodePudding user response：

week = ['week{}'.format(i) for i in range(1,23)]
df = pd.read_excel('a.xlsx')
df2 = pd.concat([df.query("metric == 'A'") 
                .groupby(['metric','id','name'],as_index=False)[week].agg('sum'),
           df.query("metric != 'A'")
                .groupby(['metric','id','name'],as_index=False)[week].agg('max')
          ]).reset_index(drop=True)
df3 = pd.concat([df.groupby(['metric','id','name'],as_index=False)['size'].agg('max')]).reset_index(drop=True)
df4 = pd.merge(df2, df3, how='inner', on=['metric'], suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
df = df4.reindex(columns=df.columns)
df

Create 22-week columns using list comprehension
Read file into pandas data frame
Group df by metric, id, and name summing all the week columns for metric='A'
Group df by metric, id, and name finding the max values of the week columns for metric='B' and 'C'
Group df by metric, id, and name finding the max size
Merge two dfs without keeping the duplicates
Reindex the columns of the final df using the reference of the main df

Output:

    id  name    size    metric  week1   week2   
0   54  foo       6      A       60.0    60.0   
1   54  foo       6      B       0.50    0.50    
2   54  foo       6      C       0.39    0.39