my dataframe looks like this:
| Id | name | size | metric | week1 | week2 | ... until week 20 |
|---|---|---|---|---|---|---|
| 54 | foo | 1 | A | 20 | 20 | .... |
| 54 | foo | 1 | B | 0 | 0 | .... |
| 54 | foo | 1 | C | 0.39 | 0.39 | .... |
| 54 | foo | 6 | A | 40 | 40 | .... |
| 54 | foo | 6 | B | 0.50 | 0.50 | .... |
| 54 | foo | 6 | C | 0.39 | 0.39 | .... |
So I have 6 rows for every id, but instead I want only 3 rows. I want to merge row 1 with row 4, row 2 with row 5 and row 3 with row 6.
So my goal is to get something like this:
| Id | name | size | metric | week1 | week2 | ... until week 20 |
|---|---|---|---|---|---|---|
| 54 | foo | 6 | A | 60 | 60 | .... |
| 54 | foo | 6 | B | 0.50 | 0.50 | .... |
| 54 | foo | 6 | C | 0.39 | 0.39 | .... |
-> For row with "metric" A, sum every "week"-column.
-> For row with "metric" B and C, check if values are equal (if not: get the higher value)
I already tried something with:
df.groupby(["id","metric"])[df.columns[4:].sum()
But then all rows with "metric" B and C are sum as well.
And for the size:
Keep the size where the week values (of metric A) are bigger.
So rows with size 1 have values of 20 and rows with size 6 have values of 40 -> Therefore keep 6 as size.
I know there is a solution with nested loops, but I am pretty sure this can be done with pandas.
CodePudding user response:
You can use below code to generate the expected output, assuming that size 6 coming bcz 6 is the max value.
data = {"Id": [54, 54, 54, 54, 54, 54],"name": ["foo", "foo","foo","foo","foo","foo"],"size":[1,1,1,6,6,6],
"metrics":["A", "B", "C", "A", "B", "C"], "week1": [20, 0, 0.39, 40, 0.50, 0.39],
"week2": [20, 0, 0.39, 40, 0.50, 0.39]
}
df.groupby(["metrics"]).agg({"Id":max, "name":max, "size":max, "week1":sum, "week2":sum})
OutPut
| metrics | Id | name | size | week1 | week2 |
|---|---|---|---|---|---|
| A | 54 | foo | 6 | 60.00 | 60.00 |
| B | 54 | foo | 6 | 0.50 | 0.50 |
| C | 54 | foo | 6 | 0.78 | 0.78 |
CodePudding user response:
week = ['week{}'.format(i) for i in range(1,23)]
df = pd.read_excel('a.xlsx')
df2 = pd.concat([df.query("metric == 'A'")
.groupby(['metric','id','name'],as_index=False)[week].agg('sum'),
df.query("metric != 'A'")
.groupby(['metric','id','name'],as_index=False)[week].agg('max')
]).reset_index(drop=True)
df3 = pd.concat([df.groupby(['metric','id','name'],as_index=False)['size'].agg('max')]).reset_index(drop=True)
df4 = pd.merge(df2, df3, how='inner', on=['metric'], suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
df = df4.reindex(columns=df.columns)
df
- Create 22-week columns using list comprehension
- Read file into pandas data frame
- Group df by metric, id, and name summing all the week columns for metric='A'
- Group df by metric, id, and name finding the max values of the week columns for metric='B' and 'C'
- Group df by metric, id, and name finding the max size
- Merge two dfs without keeping the duplicates
- Reindex the columns of the final df using the reference of the main df
Output:
id name size metric week1 week2
0 54 foo 6 A 60.0 60.0
1 54 foo 6 B 0.50 0.50
2 54 foo 6 C 0.39 0.39
