Home > Software engineering >  Group data by timedelta in python
Group data by timedelta in python

Time:01-17

I have been collecting fuel consumption data from flowmeters connected to ship diesel engines for more than a year. Data is captured at 1 minute intervals and is presented to me in comma delimited .csv format. I am using python/pandas/jupyter notebooks to analyse the data. Here is an example of data read into a juypter notebook.

import pandas as pd

df = pd.read_csv('C:/Users/NETE/Desktop/Test.csv', parse_dates=['Date_Time'], index_col='Date_Time')
df

Out[1]: 
                     Fuel Consumption
Date_Time                            
2022-01-13 15:30:00              10.3
2022-01-13 15:31:00              10.5
2022-01-13 15:32:00              10.6
2022-01-13 15:33:00              10.3
2022-01-13 15:34:00              10.5
2022-01-13 15:35:00              10.4
2022-01-13 16:35:00               0.1
2022-01-13 16:36:00              20.5
2022-01-13 16:37:00               6.0
2022-01-13 16:38:00              18.0
2022-01-13 17:15:00              15.3
2022-01-13 17:16:00              15.7
2022-01-13 17:17:00              15.7
2022-01-13 17:18:00              15.4
2022-01-13 17:19:00              15.5
2022-01-13 17:20:00              15.7
2022-01-13 22:03:00              26.8
2022-01-13 22:04:00              27.1

I want to be able to group the data by timestamp interval so that each block of data, where timestamps are only 1 minute apart, group together. Once a timedelta greater than a minute is encountered a new block should be created. Ultimately I want to use the .value_counts() method on the groups so that I can drop groups less than 5 minutes duration.

I have considered .diff(), .groupby() and pd.Grouper methods but I am a python (and coding in general) novice and don't know where to go with this. I have spent many hours researching the problem to no avail. Any guidance would be most appreciated.

CodePudding user response:

The trick is to compute time differences between each row. For that you can use shift() method. When you know the differences, you can simply decide if the difference meets a predefined condition - in your case, if the difference is greater than 1 minute. This operation makes a boolean feature. Applying cumsum() on top of that makes group identifiers you are looking for.

This code

# import libraries
import pandas as pd
import numpy as np

# read the data
df = pd.read_csv('./data.csv')
df.columns = ['dt', 'consumption']

# do the processing
df['dt'] = pd.to_datetime(df.dt)
df['dt_diff_in_min'] = (df.dt - df.dt.shift()) / np.timedelta64(1, 'm')
df['group_id'] = df.dt_diff_in_min.gt(1).cumsum()

# show the result
print(df)

returns the following output.

                    dt  consumption  dt_diff_in_min  group_id
0  2022-01-13 15:30:00         10.3             NaN         0
1  2022-01-13 15:31:00         10.5             1.0         0
2  2022-01-13 15:32:00         10.6             1.0         0
3  2022-01-13 15:33:00         10.3             1.0         0
4  2022-01-13 15:34:00         10.5             1.0         0
5  2022-01-13 15:35:00         10.4             1.0         0
6  2022-01-13 16:35:00          0.1            60.0         1
7  2022-01-13 16:36:00         20.5             1.0         1
8  2022-01-13 16:37:00          6.0             1.0         1
9  2022-01-13 16:38:00         18.0             1.0         1
10 2022-01-13 17:15:00         15.3            37.0         2
11 2022-01-13 17:16:00         15.7             1.0         2
12 2022-01-13 17:17:00         15.7             1.0         2
13 2022-01-13 17:18:00         15.4             1.0         2
14 2022-01-13 17:19:00         15.5             1.0         2
15 2022-01-13 17:20:00         15.7             1.0         2
16 2022-01-13 22:03:00         26.8           283.0         3
17 2022-01-13 22:04:00         27.1             1.0         3
  •  Tags:  
  • Related