Home > Net >  Pandas Group by certain condition
Pandas Group by certain condition

Time:01-12

my dataframe looks like this:

Id name size metric week1 week2 ... until week 20
54 foo 1 A 20 20 ....
54 foo 1 B 0 0 ....
54 foo 1 C 0.39 0.39 ....
54 foo 6 A 40 40 ....
54 foo 6 B 0.50 0.50 ....
54 foo 6 C 0.39 0.39 ....

So I have 6 rows for every id, but instead I want only 3 rows. I want to merge row 1 with row 4, row 2 with row 5 and row 3 with row 6.

So my goal is to get something like this:

Id name size metric week1 week2 ... until week 20
54 foo 6 A 60 60 ....
54 foo 6 B 0.50 0.50 ....
54 foo 6 C 0.39 0.39 ....

-> For row with "metric" A, sum every "week"-column.
-> For row with "metric" B and C, check if values are equal (if not: get the higher value)

I already tried something with:

df.groupby(["id","metric"])[df.columns[4:].sum()

But then all rows with "metric" B and C are sum as well.


And for the size:
Keep the size where the week values (of metric A) are bigger. So rows with size 1 have values of 20 and rows with size 6 have values of 40 -> Therefore keep 6 as size.

I know there is a solution with nested loops, but I am pretty sure this can be done with pandas.

CodePudding user response:

You can use below code to generate the expected output, assuming that size 6 coming bcz 6 is the max value.

data = {"Id": [54, 54, 54, 54, 54, 54],"name": ["foo", "foo","foo","foo","foo","foo"],"size":[1,1,1,6,6,6], 
    "metrics":["A", "B", "C", "A", "B", "C"], "week1": [20, 0, 0.39, 40, 0.50, 0.39],
    "week2": [20, 0, 0.39, 40, 0.50, 0.39]
   }


 df.groupby(["metrics"]).agg({"Id":max, "name":max, "size":max, "week1":sum, "week2":sum})

OutPut

metrics Id name size week1 week2
A 54 foo 6 60.00 60.00
B 54 foo 6 0.50 0.50
C 54 foo 6 0.78 0.78

CodePudding user response:

week = ['week{}'.format(i) for i in range(1,23)]
df = pd.read_excel('a.xlsx')
df2 = pd.concat([df.query("metric == 'A'") 
                .groupby(['metric','id','name'],as_index=False)[week].agg('sum'),
           df.query("metric != 'A'")
                .groupby(['metric','id','name'],as_index=False)[week].agg('max')
          ]).reset_index(drop=True)
df3 = pd.concat([df.groupby(['metric','id','name'],as_index=False)['size'].agg('max')]).reset_index(drop=True)
df4 = pd.merge(df2, df3, how='inner', on=['metric'], suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
df = df4.reindex(columns=df.columns)
df
  1. Create 22-week columns using list comprehension
  2. Read file into pandas data frame
  3. Group df by metric, id, and name summing all the week columns for metric='A'
  4. Group df by metric, id, and name finding the max values of the week columns for metric='B' and 'C'
  5. Group df by metric, id, and name finding the max size
  6. Merge two dfs without keeping the duplicates
  7. Reindex the columns of the final df using the reference of the main df

Output:

    id  name    size    metric  week1   week2   
0   54  foo       6      A       60.0    60.0   
1   54  foo       6      B       0.50    0.50    
2   54  foo       6      C       0.39    0.39    
  •  Tags:  
  • Related