I have a dataframe and I interested only the data above string text = "purchase" by session. input dataframe
| session | Date | action | flag_purchase |
|---|---|---|---|
| T001 | 01-01-2021 00.01 | click | 1 |
| T001 | 01-01-2021 00.15 | play | 1 |
| T001 | 01-01-2021 02.15 | pause | 1 |
| T001 | 01-01-2021 03.15 | play | 1 |
| T001 | 01-01-2021 04.15 | purchase | 1 |
| T001 | 02-01-2021 10.15 | play | 1 |
| T001 | 02-01-2021 12.00 | pause | 1 |
| T001 | 02-01-2021 13.15 | play | 1 |
| T002 | 01-01-2021 00.01 | play | 0 |
| T002 | 03-01-2021 00.15 | play | 0 |
| T002 | 03-01-2021 02.15 | pause | 0 |
| T002 | 03-01-2021 03.15 | play | 0 |
I want to drop all rows below action = "purchase", If all actions in session doesn’t have text match the session will keep all rows, so the output that I want looks like:
| session | Date | action | flag_purchase |
|---|---|---|---|
| T001 | 01-01-2021 00.01 | click | 1 |
| T001 | 01-01-2021 00.15 | play | 1 |
| T001 | 01-01-2021 02.15 | pause | 1 |
| T001 | 01-01-2021 03.15 | play | 1 |
| T001 | 01-01-2021 04.15 | purchase | 1 |
| T002 | 01-01-2021 00.01 | play | 0 |
| T002 | 03-01-2021 00.15 | play | 0 |
| T002 | 03-01-2021 02.15 | pause | 0 |
| T002 | 03-01-2021 03.15 | play | 0 |
CodePudding user response:
Try:
to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)
# Output
session Date action flag_purchase
0 T001 01-01-2021 00.01 click 1
1 T001 01-01-2021 00.15 play 1
2 T001 01-01-2021 02.15 pause 1
3 T001 01-01-2021 03.15 play 1
4 T001 01-01-2021 04.15 purchase 1
8 T002 01-01-2021 00.01 play 0
9 T002 03-01-2021 00.15 play 0
10 T002 03-01-2021 02.15 pause 0
11 T002 03-01-2021 03.15 play 0
CodePudding user response:
If I understand you correctly, then you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})
print(df)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 3 1 purchase
# 4 2 play
# 5 2 purchase
# 6 2 pause
# 7 2 play
# 8 3 play
# 9 3 pause
def get_idx(row):
"""
Gets the first index of where "purchase" occurs, then
return the rows untill and incl that index
"""
idx = np.argwhere(row.values=="purchase") #get index
if idx.size>0: #check if it exists
idx = idx[0][0] 1
return row[:idx] #return the rows
return row #else, return the original rows
df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 4 2 play
# 5 2 purchase
# 8 3 play
# 9 3 pause
