Having a DF of ids and timestamps, like:
id timestamp idx
0 1 2021-10-24 17:56:03.641 0
1 1 2021-10-24 17:56:04.086 1
2 1 2021-10-24 17:56:11.217 2
I'm trying to group time ranges in each id by 5 minutes and set the first idx of each group to the entire range, like so:
id timestamp idx first_index
0 1 2021-10-24 17:56:03.641 0 0
1 1 2021-10-24 17:56:04.086 1 0
2 1 2021-10-24 17:56:11.217 2 0
3 1 2021-10-24 19:33:50.359 3 3 <-- taking new idx out 5 min range
4 1 2021-10-24 19:33:55.278 4 3
I've written the following code:
def time_groups(df):
df = df.copy()
# grouping per 5Min and id
g = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
# get first values per groups to new column
df['first_index'] = g['idx'].transform('first')
return df
Code works fine for a single id, but when I have some ids it produce different results for the same id. I have written the following Colab to reproduce the issue.
With all data ends before 5 min:
id timestamp idx first_index
171 6 2021-10-24 18:03:39.323 171 171
172 6 2021-10-24 18:03:53.551 172 171
173 6 2021-10-24 18:04:21.932 173 171
174 6 2021-10-24 18:04:22.221 174 171
175 6 2021-10-24 18:04:22.484 175 171
176 6 2021-10-24 18:04:36.399 176 171
177 6 2021-10-24 18:04:36.649 177 171
------------------------------------------------- Group end
178 6 2021-10-24 18:06:20.437 178 178
179 6 2021-10-24 18:06:20.695 179 178
180 6 2021-10-24 18:06:21.001 180 178
With only one id does not end before 5 min:
id timestamp idx first_index
171 6 2021-10-24 18:03:39.323 171 171
172 6 2021-10-24 18:03:53.551 172 171
173 6 2021-10-24 18:04:21.932 173 171
174 6 2021-10-24 18:04:22.221 174 171
175 6 2021-10-24 18:04:22.484 175 171
176 6 2021-10-24 18:04:36.399 176 171
177 6 2021-10-24 18:04:36.649 177 171
178 6 2021-10-24 18:06:20.437 178 171
179 6 2021-10-24 18:06:20.695 179 171
180 6 2021-10-24 18:06:21.001 180 171
What am I missing?
Update: If the first ID is removed:
df = df[3:]
code works fine
CodePudding user response:
I think this is because group[er origin is looking at first timestamp in the entire series, and not per grouped id.
This seems to work:
def tgs(df):
df_list = [g for _,g in df.groupby('id')]
res_list = []
for df_s in df_list:
g = df_s.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])
df_s['first_index'] = g['idx'].transform('first')
res_list.append(df_s)
return pd.concat(res_list)
CodePudding user response:
When applying:
df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
The origin of the timestamp grouping is the first timestamp in the entire dataframe, not per group.
According to the doc ‘start’: origin is the first value of the timeseries https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
looking at df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).size() you can see that all groups are in 5 minute intervals (or multiplications of 5 minute intervals) even groups that are per different id :
id timestamp
1 2021-10-24 17:56:03.641 3
2 2021-10-24 19:31:03.641 10
2021-10-24 19:36:03.641 9
...
6 2021-10-24 18:01:03.641 7
2021-10-24 18:06:03.641 13
...
If you look at id 6, its first group is actually at an earlier timestamp than it's first event. This is caused for the same reason - the "buckets" for all users are based on 5 minute intervals from the first timestamp of the entire dataset. all rows that are before 18:06:03.641 are grouped in a 18:01:03.641 "bucket" and all rows that are after are grouped to the 18:06:03.641 "bucket".
The first row of the dataset is the earliest, so when you remove the first user the bug is no longer visible.
I think you can get the functionality you're looking for by first grouping by id and then applying an additional group-with-grouper using apply:
def split_to_five_minute_groups(x):
return (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')]))[['idx']].transform('first')
df['first_idx'] = df.groupby(['id']).apply(split_to_five_minute_groups)
