I have a pandas dataframe like this:
import numpy as np
import pandas as pd
data = np.random.randint(0, 10, (4, 8))
multiindex = pd.MultiIndex.from_product([["a", "b"], [1, 2, 3, 4]])
df = pd.DataFrame(data, columns=multiindex)
df
lvl1 A B
lvl2 a b c d a b c d
0 3 5 3 6 3 6 4 9
1 9 6 1 9 7 1 5 1
2 6 5 5 6 0 9 7 7
3 3 0 1 4 7 1 4 8
For each unique entry in "lvl1" I want to build sums like x = a b and y = c d. The result should look like this:
A B
x y x y
0 7 16 9 1
1 8 6 1 8
2 8 12 10 9
3 7 7 11 7
I can reach this by iterating and concat like below, but there is probably a more elegant way with something like groupby, apply or agg.
list_df = []
sum_x = ["a", "b"]
sum_y = ["c", "d"]
for level in df.columns.get_level_values(level=0).unique():
for sum_xy, title in zip([sum_x, sum_y], ["x", "y"]):
df_sum = df.loc[:, (level, sum_xy)].sum(axis=1).rename((level, title))
list_df.append(df_sum)
pd.concat(list_df, axis=1)
When I do a groupby with df.groupby("lvl1", axis=1).apply... I have the problem that the "lvl1" header remains and I don't know how to continue. Also I tried something with df.stack(level=0) to get rid of the multiindex without succes. I tried the approaches in these questions but could't find a working solution. Does anyone have an idea? Do I need a groupby at all or not? Thank you!
How to sum specific hierarchical columns in pandas?
Aggregating lambda functions in pandas and numpy
CodePudding user response:
One way is to build a dictionary, use it to map the second level, then you can groupby:
lookup = {c:'x' for c in sum_x}
lookup.update({c:'y' for c in sum_y})
df.groupby([df.columns.get_level_values(0),
df.columns.get_level_values(1).map(lookup)],
axis=1).sum()
Option: also
df.rename(columns=lookup, level=1).groupby(level=(0,1), axis=1).sum()
You would get something like this:
A B
x y x y
0 8 5 14 14
1 11 7 9 1
2 16 3 7 5
3 10 2 1 12
CodePudding user response:
Lest try:
Preparation:
import numpy as np
import pandas as pd
data = np.array([[9, 7, 9, 6, 1, 3, 9, 6],
[3, 7, 6, 7, 0, 9, 1, 2],
[3, 3, 0, 8, 6, 3, 1, 5],
[2, 4, 0, 6, 2, 7, 6, 9]])
multiindex = pd.MultiIndex.from_product([["A", "B"], ['a', 'b', 'c', 'd']])
df = pd.DataFrame(data, columns=multiindex)
sum_x = ["a", "b"]
sum_y = ["c", "d"]
Code:
- insert 'by hand' a level in the columns multiindex
df.columns = pd.MultiIndex.from_tuples([(tup[0], 'x' if tup[1] in sum_x else 'y' ,tup[1])
for tup in df.columns])
Now df:
A B
x y x y
a b c d a b c d
0 9 7 9 6 1 3 9 6
1 3 7 6 7 0 9 1 2
2 3 3 0 8 6 3 1 5
3 2 4 0 6 2 7 6 9
- standard sum on columns multindex last level
df.stack().groupby(level=0).sum()
Result:
A B
x y x y
0 16.0 15.0 4.0 15.0
1 10.0 13.0 9.0 3.0
2 6.0 8.0 9.0 6.0
3 6.0 6.0 9.0 15.0
