I have been collecting fuel consumption data from flowmeters connected to ship diesel engines for more than a year. Data is captured at 1 minute intervals and is presented to me in comma delimited .csv format. I am using python/pandas/jupyter notebooks to analyse the data. Here is an example of data read into a juypter notebook.
import pandas as pd
df = pd.read_csv('C:/Users/NETE/Desktop/Test.csv', parse_dates=['Date_Time'], index_col='Date_Time')
df
Out[1]:
Fuel Consumption
Date_Time
2022-01-13 15:30:00 10.3
2022-01-13 15:31:00 10.5
2022-01-13 15:32:00 10.6
2022-01-13 15:33:00 10.3
2022-01-13 15:34:00 10.5
2022-01-13 15:35:00 10.4
2022-01-13 16:35:00 0.1
2022-01-13 16:36:00 20.5
2022-01-13 16:37:00 6.0
2022-01-13 16:38:00 18.0
2022-01-13 17:15:00 15.3
2022-01-13 17:16:00 15.7
2022-01-13 17:17:00 15.7
2022-01-13 17:18:00 15.4
2022-01-13 17:19:00 15.5
2022-01-13 17:20:00 15.7
2022-01-13 22:03:00 26.8
2022-01-13 22:04:00 27.1
I want to be able to group the data by timestamp interval so that each block of data, where timestamps are only 1 minute apart, group together. Once a timedelta greater than a minute is encountered a new block should be created. Ultimately I want to use the .value_counts() method on the groups so that I can drop groups less than 5 minutes duration.
I have considered .diff(), .groupby() and pd.Grouper methods but I am a python (and coding in general) novice and don't know where to go with this. I have spent many hours researching the problem to no avail. Any guidance would be most appreciated.
CodePudding user response:
The trick is to compute time differences between each row. For that you can use shift() method. When you know the differences, you can simply decide if the difference meets a predefined condition - in your case, if the difference is greater than 1 minute. This operation makes a boolean feature. Applying cumsum() on top of that makes group identifiers you are looking for.
This code
# import libraries
import pandas as pd
import numpy as np
# read the data
df = pd.read_csv('./data.csv')
df.columns = ['dt', 'consumption']
# do the processing
df['dt'] = pd.to_datetime(df.dt)
df['dt_diff_in_min'] = (df.dt - df.dt.shift()) / np.timedelta64(1, 'm')
df['group_id'] = df.dt_diff_in_min.gt(1).cumsum()
# show the result
print(df)
returns the following output.
dt consumption dt_diff_in_min group_id
0 2022-01-13 15:30:00 10.3 NaN 0
1 2022-01-13 15:31:00 10.5 1.0 0
2 2022-01-13 15:32:00 10.6 1.0 0
3 2022-01-13 15:33:00 10.3 1.0 0
4 2022-01-13 15:34:00 10.5 1.0 0
5 2022-01-13 15:35:00 10.4 1.0 0
6 2022-01-13 16:35:00 0.1 60.0 1
7 2022-01-13 16:36:00 20.5 1.0 1
8 2022-01-13 16:37:00 6.0 1.0 1
9 2022-01-13 16:38:00 18.0 1.0 1
10 2022-01-13 17:15:00 15.3 37.0 2
11 2022-01-13 17:16:00 15.7 1.0 2
12 2022-01-13 17:17:00 15.7 1.0 2
13 2022-01-13 17:18:00 15.4 1.0 2
14 2022-01-13 17:19:00 15.5 1.0 2
15 2022-01-13 17:20:00 15.7 1.0 2
16 2022-01-13 22:03:00 26.8 283.0 3
17 2022-01-13 22:04:00 27.1 1.0 3
