pandas.Grouper for time intervals behavior-CodePudding

Having a DF of ids and timestamps, like:

  id               timestamp  idx
0  1 2021-10-24 17:56:03.641    0
1  1 2021-10-24 17:56:04.086    1
2  1 2021-10-24 17:56:11.217    2

I'm trying to group time ranges in each id by 5 minutes and set the first idx of each group to the entire range, like so:

  id               timestamp  idx  first_index
0  1 2021-10-24 17:56:03.641    0            0
1  1 2021-10-24 17:56:04.086    1            0
2  1 2021-10-24 17:56:11.217    2            0
3  1 2021-10-24 19:33:50.359    3            3 <-- taking new idx out 5 min range 
4  1 2021-10-24 19:33:55.278    4            3

I've written the following code:

def time_groups(df):
    df = df.copy()
    # grouping per 5Min and id
    g = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
    # get first values per groups to new column
    df['first_index'] = g['idx'].transform('first')
    return df

Code works fine for a single id, but when I have some ids it produce different results for the same id. I have written the following Colab to reproduce the issue.

With all data ends before 5 min:

     id               timestamp  idx  first_index
171   6 2021-10-24 18:03:39.323  171          171
172   6 2021-10-24 18:03:53.551  172          171
173   6 2021-10-24 18:04:21.932  173          171
174   6 2021-10-24 18:04:22.221  174          171
175   6 2021-10-24 18:04:22.484  175          171
176   6 2021-10-24 18:04:36.399  176          171
177   6 2021-10-24 18:04:36.649  177          171
------------------------------------------------- Group end
178   6 2021-10-24 18:06:20.437  178          178
179   6 2021-10-24 18:06:20.695  179          178
180   6 2021-10-24 18:06:21.001  180          178

With only one id does not end before 5 min:

     id               timestamp  idx  first_index
171   6 2021-10-24 18:03:39.323  171          171
172   6 2021-10-24 18:03:53.551  172          171
173   6 2021-10-24 18:04:21.932  173          171
174   6 2021-10-24 18:04:22.221  174          171
175   6 2021-10-24 18:04:22.484  175          171
176   6 2021-10-24 18:04:36.399  176          171
177   6 2021-10-24 18:04:36.649  177          171
178   6 2021-10-24 18:06:20.437  178          171
179   6 2021-10-24 18:06:20.695  179          171
180   6 2021-10-24 18:06:21.001  180          171

What am I missing?

Update: If the first ID is removed:

df = df[3:]

code works fine

CodePudding user response：

I think this is because group[er origin is looking at first timestamp in the entire series, and not per grouped id.

This seems to work:

def tgs(df):
  df_list = [g for _,g in df.groupby('id')]
  res_list = []
  for df_s in df_list:
    g = df_s.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])
    df_s['first_index'] = g['idx'].transform('first')
    res_list.append(df_s)
  return pd.concat(res_list)

CodePudding user response：

When applying:

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])

The origin of the timestamp grouping is the first timestamp in the entire dataframe, not per group.

According to the doc ‘start’: origin is the first value of the timeseries https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html

looking at df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).size() you can see that all groups are in 5 minute intervals (or multiplications of 5 minute intervals) even groups that are per different id :

id  timestamp              
1   2021-10-24 17:56:03.641     3
2   2021-10-24 19:31:03.641    10
    2021-10-24 19:36:03.641     9
...

6   2021-10-24 18:01:03.641     7
    2021-10-24 18:06:03.641    13
    ...

If you look at id 6, its first group is actually at an earlier timestamp than it's first event. This is caused for the same reason - the "buckets" for all users are based on 5 minute intervals from the first timestamp of the entire dataset. all rows that are before 18:06:03.641 are grouped in a 18:01:03.641 "bucket" and all rows that are after are grouped to the 18:06:03.641 "bucket".

The first row of the dataset is the earliest, so when you remove the first user the bug is no longer visible.

I think you can get the functionality you're looking for by first grouping by id and then applying an additional group-with-grouper using apply:

def split_to_five_minute_groups(x):
  return (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')]))[['idx']].transform('first')

df['first_idx'] = df.groupby(['id']).apply(split_to_five_minute_groups)