My dataset looks like this:
main_id time_stamp
aaa 2019-05-29 08:16:05 05
aaa 2019-05-30 00:11:05 05
aaa 2020-05-30 09:15:07 05
bbb 2019-05-29 09:11:05 05
For each main_id, I want to:
a) sort the time_stamps in an ascending order
b) I want to create a new column day, which uses the time_stamp to derive a number that describes the business day.
Business days are defined like this:
Monday 05:00 - Tuesday 01:00 (1 Business Day i.e Monday)
Tuesday 05:00 - Wednesday 01:00 => (1 Business Day i.e Tuesday)
and so on...
The first and second rows with main_id = aaaare from the same business day since the second row is showing time before 1 am on the next day. So, this is the very first business day and the day column would have 1.
However, in the third row, the timestamp is from another business day so we add 2 as the day.
The end result could look something like this:
main_id time_stamp day
aaa 2019-05-29 08:16:05 05 1
aaa 2019-05-30 00:11:05 05 1
aaa 2020-05-30 09:15:07 05 2
bbb 2019-05-29 09:11:05 05 1
Day 1 would be anywhere between the first 5:00 am - next day's 1 am. While day 2 would be the next possible business day (next 5 am - 1 am)
How can I achieve this?
CodePudding user response:
- In order to sort the timestamps in ascending order, do this:
#Let's say the dataframe is df
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
df.sort_values(by='time_stamp')
- For the business days one, I would do this:
day1= #add the end of the first businesss date, like: 2019-05-30 01:00
df['day']=1
for i in df.index:
df['day'].iloc[i] =ceil(df['day'].iloc[i] - day1)
CodePudding user response:
A simple method would be to subtract 5 hours, then to group by sorted dates to get the group number:
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
s = df['time_stamp'].sub(pd.Timedelta('5h'))
df['day'] = df.groupby(s.dt.date).ngroup().add(1)
NB. you actually don't need to sort the values, groupby sorts the value by default.
Variant to apply per "main_id":
df['day'] = (df.groupby('main_id')
.apply(lambda d: d.groupby(s.dt.date).ngroup().add(1)).droplevel(0)
)
Output:
main_id time_stamp day
0 aaa 2019-05-29 08:16:05 05:00 1
1 aaa 2019-05-30 00:11:05 05:00 1
2 aaa 2020-05-30 09:15:07 05:00 2
3 bbb 2019-05-29 09:11:05 05:00 1
