I have a dataframe:
data = {'process': ['buying','selling','searhicng','repairing', 'preparing', 'selling','buying', 'searching', 'selling','searching'],
'type': ['in_progress','in_progress','end','in_progress', 'end', 'in_progress','in_progress', 'end', 'in_progress','end'],
'country': ['usa',np.nan, 'usa','ghana', np.nan,'end','portugal', np.nan, np.nan,'england'],
'id': ['022','022','022', '011','011', '011','011', '011', '011','011'],
'lag': ['00:00:10.042721','00:00:00.042721','00:00:05.042721','00:10:00.042721','00:00:00.042721','00:00:00.042721','00:00:50.042721','00:00:00.042721','00:00:00.042721','00:00:00.042721'],
'created': ['2021-07-01','2021-07-02','2021-07-03','2021-07-04','2021-07-05','2021-07-06','2021-07-06','2021-07-08','2021-07-09','2021-07-10'],
'next_created': ['2021-07-01','2021-07-02','2021-07-03','2021-07-04','2021-07-05','2021-07-06','2021-07-07','2021-07-08','2021-07-09','2021-07-10']
}
df = pd.DataFrame(data, columns = ['process','type','country', 'id','lag','created','next_created'])
I need to concatenate consecutive rows by the process column for each group by id, which have lag less than one second, write the value of the first row to created, and created_next the value of the last line.
Can anyone see the problem, i don't understand how i can use groupby in this situation.
I guess that i need to use cumsum(), but i don't know what i can use in the ??? place
df['lag'].shift(1).???.cumsum()
Output result
CodePudding user response:
You can try:
# Convert to timedelta to facilitate checking of within one second
df['lag'] = pd.to_timedelta(df['lag'])
# Grouping by `lag` difference is less than one second within the same `id`
group = df['lag'].diff().abs().gt(np.timedelta64(1, 's')).groupby(df['id']).cumsum()
# Group by `id` and newly created grouping and then aggregate
(df.groupby(['id', group], as_index=False, sort=False)
.agg({'process': lambda x: ' '.join(x), # concatenate consecutive rows within total grouping
'type': 'first',
'country': lambda x: x.iloc[0], # get first entry including `NaN`
'id': 'first',
'lag': 'first',
'created': 'first', # get first entry
'next_created': 'last' # get last entry
})
)
Result:
process type country id lag created next_created
0 buying in_progress usa 022 0 days 00:00:10.042721 2021-07-01 2021-07-01
1 selling in_progress NaN 022 0 days 00:00:00.042721 2021-07-02 2021-07-02
2 searhicng end usa 022 0 days 00:00:05.042721 2021-07-03 2021-07-03
3 repairing in_progress ghana 011 0 days 00:10:00.042721 2021-07-04 2021-07-04
4 preparing selling end NaN 011 0 days 00:00:00.042721 2021-07-05 2021-07-06
5 buying in_progress portugal 011 0 days 00:00:50.042721 2021-07-06 2021-07-07
6 searching selling searching end NaN 011 0 days 00:00:00.042721 2021-07-08 2021-07-10


