Group data by timedelta in python-CodePudding

I have been collecting fuel consumption data from flowmeters connected to ship diesel engines for more than a year. Data is captured at 1 minute intervals and is presented to me in comma delimited .csv format. I am using python/pandas/jupyter notebooks to analyse the data. Here is an example of data read into a juypter notebook.

import pandas as pd

df = pd.read_csv('C:/Users/NETE/Desktop/Test.csv', parse_dates=['Date_Time'], index_col='Date_Time')
df

Out[1]: 
                     Fuel Consumption
Date_Time                            
2022-01-13 15:30:00              10.3
2022-01-13 15:31:00              10.5
2022-01-13 15:32:00              10.6
2022-01-13 15:33:00              10.3
2022-01-13 15:34:00              10.5
2022-01-13 15:35:00              10.4
2022-01-13 16:35:00               0.1
2022-01-13 16:36:00              20.5
2022-01-13 16:37:00               6.0
2022-01-13 16:38:00              18.0
2022-01-13 17:15:00              15.3
2022-01-13 17:16:00              15.7
2022-01-13 17:17:00              15.7
2022-01-13 17:18:00              15.4
2022-01-13 17:19:00              15.5
2022-01-13 17:20:00              15.7
2022-01-13 22:03:00              26.8
2022-01-13 22:04:00              27.1

I want to be able to group the data by timestamp interval so that each block of data, where timestamps are only 1 minute apart, group together. Once a timedelta greater than a minute is encountered a new block should be created. Ultimately I want to use the .value_counts() method on the groups so that I can drop groups less than 5 minutes duration.

I have considered .diff(), .groupby() and pd.Grouper methods but I am a python (and coding in general) novice and don't know where to go with this. I have spent many hours researching the problem to no avail. Any guidance would be most appreciated.

CodePudding user response：

The trick is to compute time differences between each row. For that you can use shift() method. When you know the differences, you can simply decide if the difference meets a predefined condition - in your case, if the difference is greater than 1 minute. This operation makes a boolean feature. Applying cumsum() on top of that makes group identifiers you are looking for.

This code

# import libraries
import pandas as pd
import numpy as np

# read the data
df = pd.read_csv('./data.csv')
df.columns = ['dt', 'consumption']

# do the processing
df['dt'] = pd.to_datetime(df.dt)
df['dt_diff_in_min'] = (df.dt - df.dt.shift()) / np.timedelta64(1, 'm')
df['group_id'] = df.dt_diff_in_min.gt(1).cumsum()

# show the result
print(df)

returns the following output.

                    dt  consumption  dt_diff_in_min  group_id
0  2022-01-13 15:30:00         10.3             NaN         0
1  2022-01-13 15:31:00         10.5             1.0         0
2  2022-01-13 15:32:00         10.6             1.0         0
3  2022-01-13 15:33:00         10.3             1.0         0
4  2022-01-13 15:34:00         10.5             1.0         0
5  2022-01-13 15:35:00         10.4             1.0         0
6  2022-01-13 16:35:00          0.1            60.0         1
7  2022-01-13 16:36:00         20.5             1.0         1
8  2022-01-13 16:37:00          6.0             1.0         1
9  2022-01-13 16:38:00         18.0             1.0         1
10 2022-01-13 17:15:00         15.3            37.0         2
11 2022-01-13 17:16:00         15.7             1.0         2
12 2022-01-13 17:17:00         15.7             1.0         2
13 2022-01-13 17:18:00         15.4             1.0         2
14 2022-01-13 17:19:00         15.5             1.0         2
15 2022-01-13 17:20:00         15.7             1.0         2
16 2022-01-13 22:03:00         26.8           283.0         3
17 2022-01-13 22:04:00         27.1             1.0         3