How to pivot and concatenate sequence of repeating rows in pandas dataframe?-CodePudding

I have a following dataframe,

MachineId     Timestamp                Events    EventValue     CycleEnd       UniqueId            
7             2021-11-01 20:45:17      Attr-A       50           0                0
7             2021-11-01 20:45:44      Attr-B       1000         0                0
7             2021-11-01 21:00:00      Attr-C       5            0                0
7             2021-11-01 21:03:36      End          NULL         1                0
7             2021-11-01 21:11:43      Attr-B       1100         0                1
7             2021-11-01 21:11:44      Attr-C       2            0                1
7             2021-11-01 21:25:01      End          NULL         1                1

The resultant dataframe should look like,

MachineId    CycleId    CycleStarttime         CycleEndtime          Attr-A  Attr-B  Attr-C   End  
7            1          2021-11-01 20:45:17    2021-11-01 21:03:36   50      1000    5        NULL
7            2          2021-11-01 21:11:43    2021-11-01 21:25:01   NULL    1100    2        NULL

The CycleEnd column is related to the Events column. If its 1, then it means that a cycle completed. I would like to sequentially pivot the dataframe for each cycle-end, so that I can get the values of Events as column with its corresponding value from the EventValuecolumn.

My approach was to create a UniqueId column from the CycleEnd column by using df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum() which defines each cycle individually. And then, loop over each unique Id and filter the dataframe and then store the required information to a temporary new dataframe and then append to a list. Finally concatenate.

I would like to know if there is any other more efficient (performance wise) approach for the above problem. The dataframe has hundreds of thousands of rows. Thank you!

CodePudding user response：

I think here is possible aggregate min and max datetimes per groups, then pivoting by DataFrame.pivot_table with aggregate function like mean, sum and last join together:

df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum().add(1)

df1 = (df.groupby(['MachineId','UniqueId'])
         .agg(CycleStarttime=('Timestamp','min'), CycleEndtime=('Timestamp','max')))
df2 = df.pivot_table(index=['MachineId','UniqueId'],
                     columns='Events',
                     values='EventValue', 
                     aggfunc='sum')

df  = df1.join(df2).rename_axis(['MachineId','CycleId']).reset_index()
print (df)

   MachineId  CycleId      CycleStarttime        CycleEndtime  Attr-A  Attr-B  \
0          7        1 2021-11-01 20:45:17 2021-11-01 21:03:36    50.0  1000.0   
1          7        2 2021-11-01 21:11:43 2021-11-01 21:25:01     NaN  1100.0   

   Attr-C  End  
0     5.0  NaN  
1     2.0  NaN

CodePudding user response：

Fortunately, your DataFrame has UniqueId column, which facilitates grouping.

To do your task, define a function to process group of source rows as:

def grpProc(grp):
    rv1 = pd.Series([grp.MachineId.iloc[0], grp.UniqueId.iloc[0]   1,
        grp.Timestamp.iloc[0], grp.Timestamp.iloc[-1]],
        index=['MachineId', 'CycleId', 'CycleStarttime', 'CycleEndtime'])
    rv2 = grp[:-1].pivot(index='MachineId', columns='Events',
        values='EventValue').iloc[0]
    return pd.concat([rv1, rv2])

Then run:

wrk = df.groupby('UniqueId').apply(grpProc)
result = wrk.unstack().reindex(columns=wrk[0].index)
result.index.name=None

For your source data the result is:

  MachineId CycleId      CycleStarttime        CycleEndtime Attr-A Attr-B  Attr-C
0         7       1 2021-11-01 20:45:17 2021-11-01 21:03:36     50   1000       5 
1         7       2 2021-11-01 21:11:43 2021-11-01 21:25:01    NaN   1100       2