I am looking to remove END from 'Task' column if there is no START before it. The data can be grouped by 'Session' and if the first occurance of the 'Task' is END, then I want to replace that specific occurrence with nAn value.
| Session | Task | |
|---|---|---|
| 0 | 1 | |
| 1 | 1 | END |
| 2 | 1 | |
| 3 | 1 | START |
| 4 | 1 | |
| 5 | 1 | END |
| 6 | 2 | |
| 7 | 2 | START |
| 8 | 2 | |
| 9 | 2 | END |
| 10 | 2 | |
| 11 | 2 | |
| 12 | 3 | |
| 13 | 3 | START |
| 14 | 3 | |
| 15 | 3 | |
| 16 | 4 | |
| 17 | 4 | START |
| 18 | 4 | |
| 18 | 4 | |
| 18 | 4 | END |
the DataFrame
import pandas as pd
d = {'Session':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4],
'Task':['', 'END', '', 'START', '', 'END', '', 'START', '', 'END', '', '', '', 'START', '', '', '', 'START', '', '', 'END']}
df = pd.DataFrame(data=d)
My initial thought was to get the first occurrence of 'Task' for each group, in a different data frame df2, and filter only rows with 'END' value, and then use index of df2 to remove the value from the original df.
Below is the expected table
| Session | Task | |
|---|---|---|
| 0 | 1 | |
| 1 | 1 | |
| 2 | 1 | |
| 3 | 1 | START |
| 4 | 1 | |
| 5 | 1 | END |
| 6 | 2 | |
| 7 | 2 | START |
| 8 | 2 | |
| 9 | 2 | END |
| 10 | 2 | |
| 11 | 2 | |
| 12 | 3 | |
| 13 | 3 | START |
| 14 | 3 | |
| 15 | 3 | |
| 16 | 4 | |
| 17 | 4 | START |
| 18 | 4 | |
| 18 | 4 | |
| 18 | 4 | END |
CodePudding user response:
Using Dataframe apply with a custom function to remove unmatched "END"
Code
def remove_unmatched(x):
' removes "END" value when not preceeded by "START" '
preceeded = False
result = []
for z in x:
if z == "START":
preceeded = True # Set preceeded to True since found a start
result.append(z)
elif z == "END":
result.append(z if preceeded else "") # "END" or "" based upon whether preceede by "START"
preceeded = False
else:
result.append(z) # value lunchanged
return pd.Series(result, index = x.index) # new series
df['Task'] = df.groupby('Session')['Task'].apply(remove_unmatched) # provides desired df
CodePudding user response:
import numpy as np
idx = df.where(df.groupby('Session').first().eq('END'))['Task'] == 'END'
df.loc[idx, 'Task'] = np.nan
How does it work?
Basically, you group per session and take the first non NaN value which is equal to END. That is, that session starts with an END. Then select all rows in the dataframe that matches that, for the Task column. It's necessary to add == 'END' such that there's a boolean column for indexing.
Then you can just locate those cases in the Task column and assign NaN to them.
