I would like to create multiple columns which show the row-wise cumulative mean for grouped columns. Here is some sample data:
import pandas as pd
data = [[1, 4, 6, 10, 15, 40, 90, 100], [2, 5, 3, 11, 25, 50, 90, 120], [3, 7, 9, 14, 35, 55, 100, 120]]
df = pd.DataFrame(data, columns=['a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4'])
a1 a2 a3 a4 b1 b2 b3 b4
0 1 4 6 10 15 40 90 100
1 2 5 3 11 25 50 90 120
2 3 7 9 14 35 55 100 120
What I want is to generate new columns like this:
- New column
a1_2is calculated by the mean of columnsa1anda2row-wise. - New column
a1_3is calculated by the mean of columnsa1,a2anda3row-wise. - New column
a1_4is calculated by the mean of columnsa1,a2,a3anda4row-wise.
The same should happen for the grouped columns with b. Of course you can do this manually, but this is not ideal when you have too many variables. Here is the expected output:
df['a1_2'] = df[['a1', 'a2']].mean(axis=1)
df['a1_3'] = df[['a1', 'a2', 'a3']].mean(axis=1)
df['a1_4'] = df[['a1', 'a2', 'a3', 'a4']].mean(axis=1)
df['b1_2'] = df[['b1', 'b2']].mean(axis=1)
df['b1_3'] = df[['b1', 'b2', 'b3']].mean(axis=1)
df['b1_4'] = df[['b1', 'b2', 'b3', 'b4']].mean(axis=1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
So I was wondering if there is some automatic way of doing this?
CodePudding user response:
IIUC, you can reshape with wide_to_long, perform a groupby.rolling, then unstack and flatten the multiindex:
df.join(pd
.wide_to_long(df.reset_index(), stubnames=['a', 'b'], i='index', j='n')
.groupby(level='index').expanding().mean()
.loc[lambda d: d.index.get_level_values('n')>1]
.droplevel(0).unstack('n')
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}1_{x[1]}'), axis=1))
)
output:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
CodePudding user response:
groups = df.groupby(lambda col: col[0], axis=1)
cummeans = (groups.cumsum(axis=1).div(groups.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(\d )$", r"1_\1", col), axis=1))
result = df.join(cummeans)
get the groups determined by 1st character of columns (or some other way depending on pattern)
get the cumulative means = cumsum / cumcount 1
filter out the very first cummeans, e.g., to-be a1_1 etc.
insert "1_" into the cummean column names
join with the original df
In [19]: groups = df.groupby(lambda col: col[0], axis=1)
In [20]: cummeans = groups.cumsum(axis=1).div(groups.cumcount().add(1))
In [21]: cummeans
Out[21]:
a1 a2 a3 a4 b1 b2 b3 b4
0 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
In [22]: _.filter(regex="[^1]$")
Out[22]:
a2 a3 a4 b2 b3 b4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [23]: _.rename(lambda col: re.sub(r"(\d )$", r"1_\1", col), axis=1)
Out[23]:
a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [24]: df.join(_)
Out[24]:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
as a "one" line:
df.join(df.groupby(lambda col: col[0], axis=1)
.pipe(lambda gr: gr.cumsum(axis=1).div(gr.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(\d )$", r"1_\1", col), axis=1)))
CodePudding user response:
expanding.mean
for c in ('a', 'b'):
m = df.filter(like=c).expanding(axis=1).mean().iloc[:, 1:]
df[m.columns.str.replace(r'(\d )$', r'1_\1', regex=True)] = m
Result
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
Another option:
out = [value.expanding(axis=1).mean()
.rename(columns = lambda col: f"{col[0]}1_{col[1]}")
for _, value in df.groupby(df.columns.str[0], axis = 1)]
pd.concat([df] out, axis = 1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_1 a1_2 a1_3 a1_4 b1_1 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
