I have a pandas DataFrame of the form
| quarter | user_id | # Sessions |
|---|---|---|
| 2022 Q1 | 1 | 9 |
| 2021 Q4 | 1 | |
| 2021 Q3 | 1 | |
| 2022 Q1 | 2 | 8 |
| 2021 Q4 | 2 | |
| 2021 Q3 | 2 |
And I'd like to forward fill the # Sessions column within each user_id to get a table like:
| quarter | user_id | # Sessions |
|---|---|---|
| 2022 Q1 | 1 | 9 |
| 2021 Q4 | 1 | 9 |
| 2021 Q3 | 1 | 9 |
| 2022 Q1 | 2 | 8 |
| 2021 Q4 | 2 | 8 |
| 2021 Q3 | 2 | 8 |
I can do this with
x.groupby('user_id').apply(lambda x: x.fillna({'# Sessions': x['# Sessions'].ffill()}))
But I have reasonably big data (~10k users), and this is a very common operation in the codebase.
Is .groupby necessary or is there a more performant way to achieve the same?
CodePudding user response:
You could also try transforming first (since first skips NaNs):
df['# Sessions'] = df.groupby('user_id')['# Sessions'].transform('first')
Output:
quarter user_id # Sessions
0 2022 Q1 1 9.0
1 2021 Q4 1 9.0
2 2021 Q3 1 9.0
3 2022 Q1 2 8.0
4 2021 Q4 2 8.0
5 2021 Q3 2 8.0
CodePudding user response:
We can avoid use groupby, First we use DataFrame.sort_values,
Then we just use ffill except for the rows where user_id change occurs.
df2 = df.sort_values('user_id')
df['# Sessions'] = df2['# Sessions'].ffill()\
.where(df2['user_id'].eq(df2['user_id'].shift()),
df2['# Sessions'])
CodePudding user response:
This should be much faster than what you're doing:
x['# Sessions'] = x.groupby('user_id')['# Sessions'].ffill()
