I have a following dataframe,
MachineId Timestamp Events EventValue CycleEnd UniqueId
7 2021-11-01 20:45:17 Attr-A 50 0 0
7 2021-11-01 20:45:44 Attr-B 1000 0 0
7 2021-11-01 21:00:00 Attr-C 5 0 0
7 2021-11-01 21:03:36 End NULL 1 0
7 2021-11-01 21:11:43 Attr-B 1100 0 1
7 2021-11-01 21:11:44 Attr-C 2 0 1
7 2021-11-01 21:25:01 End NULL 1 1
The resultant dataframe should look like,
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B Attr-C End
7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50 1000 5 NULL
7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NULL 1100 2 NULL
The CycleEnd column is related to the Events column. If its 1, then it means that a cycle completed.
I would like to sequentially pivot the dataframe for each cycle-end, so that I can get the values of Events as column with its corresponding value from the EventValuecolumn.
My approach was to create a UniqueId column from the CycleEnd column by using df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum() which defines each cycle individually. And then, loop over each unique Id and filter the dataframe and then store the required information to a temporary new dataframe and then append to a list. Finally concatenate.
I would like to know if there is any other more efficient (performance wise) approach for the above problem. The dataframe has hundreds of thousands of rows. Thank you!
CodePudding user response:
I think here is possible aggregate min and max datetimes per groups, then pivoting by DataFrame.pivot_table with aggregate function like mean, sum and last join together:
df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum().add(1)
df1 = (df.groupby(['MachineId','UniqueId'])
.agg(CycleStarttime=('Timestamp','min'), CycleEndtime=('Timestamp','max')))
df2 = df.pivot_table(index=['MachineId','UniqueId'],
columns='Events',
values='EventValue',
aggfunc='sum')
df = df1.join(df2).rename_axis(['MachineId','CycleId']).reset_index()
print (df)
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B \
0 7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50.0 1000.0
1 7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NaN 1100.0
Attr-C End
0 5.0 NaN
1 2.0 NaN
CodePudding user response:
Fortunately, your DataFrame has UniqueId column, which facilitates grouping.
To do your task, define a function to process group of source rows as:
def grpProc(grp):
rv1 = pd.Series([grp.MachineId.iloc[0], grp.UniqueId.iloc[0] 1,
grp.Timestamp.iloc[0], grp.Timestamp.iloc[-1]],
index=['MachineId', 'CycleId', 'CycleStarttime', 'CycleEndtime'])
rv2 = grp[:-1].pivot(index='MachineId', columns='Events',
values='EventValue').iloc[0]
return pd.concat([rv1, rv2])
Then run:
wrk = df.groupby('UniqueId').apply(grpProc)
result = wrk.unstack().reindex(columns=wrk[0].index)
result.index.name=None
For your source data the result is:
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B Attr-C
0 7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50 1000 5
1 7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NaN 1100 2
