i have data frame like this,
| id | value |
|---|---|
| a | 2 |
| a | 4 |
| a | 3 |
| a | 5 |
| b | 1 |
| b | 4 |
| b | 3 |
| c | 1 |
| c | nan |
| c | 5 |
the resulted data frame contain new column ['average'] and to get its values will be:
- make group-by(id)
- first row in 'average' column per each group is equal to its corresponding value in 'value'
- other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
- for more explain in group of (a) the "average' of index
the resulted data frame must be :
| id | value | average |
|---|---|---|
| a | 2 | 2 |
| a | 4 | 2 |
| a | 3 | 3 |
| a | 5 | 3 |
| b | 1 | 1 |
| b | 4 | 1 |
| b | 3 | 2.5 |
| c | 1 | 1 |
| c | nan | 1 |
| c | 5 | 1 |
CodePudding user response:
You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0
CodePudding user response:
Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)
