How to drop all rows below string match dataframe-CodePudding

I have a dataframe and I interested only the data above string text = "purchase" by session. input dataframe

session	Date	action	flag_purchase
T001	01-01-2021 00.01	click	1
T001	01-01-2021 00.15	play	1
T001	01-01-2021 02.15	pause	1
T001	01-01-2021 03.15	play	1
T001	01-01-2021 04.15	purchase	1
T001	02-01-2021 10.15	play	1
T001	02-01-2021 12.00	pause	1
T001	02-01-2021 13.15	play	1
T002	01-01-2021 00.01	play	0
T002	03-01-2021 00.15	play	0
T002	03-01-2021 02.15	pause	0
T002	03-01-2021 03.15	play	0

I want to drop all rows below action = "purchase", If all actions in session doesn’t have text match the session will keep all rows, so the output that I want looks like:

final result

session	Date	action	flag_purchase
T001	01-01-2021 00.01	click	1
T001	01-01-2021 00.15	play	1
T001	01-01-2021 02.15	pause	1
T001	01-01-2021 03.15	play	1
T001	01-01-2021 04.15	purchase	1
T002	01-01-2021 00.01	play	0
T002	03-01-2021 00.15	play	0
T002	03-01-2021 02.15	pause	0
T002	03-01-2021 03.15	play	0

CodePudding user response：

Try:

to_remove = lambda x: ~x.shift().eq('purchase').cumsum().astype(bool)
out = df[df.groupby('session')['action'].apply(to_remove)]
print(out)

# Output
   session              Date    action  flag_purchase
0     T001  01-01-2021 00.01     click              1
1     T001  01-01-2021 00.15      play              1
2     T001  01-01-2021 02.15     pause              1
3     T001  01-01-2021 03.15      play              1
4     T001  01-01-2021 04.15  purchase              1
8     T002  01-01-2021 00.01      play              0
9     T002  03-01-2021 00.15      play              0
10    T002  03-01-2021 02.15     pause              0
11    T002  03-01-2021 03.15      play              0

CodePudding user response：

If I understand you correctly, then you can do the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})

print(df)

#   id  action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 3  1  purchase
# 4  2  play
# 5  2  purchase
# 6  2  pause
# 7  2  play
# 8  3  play
# 9  3  pause


def get_idx(row):
    """
    Gets the first index of where "purchase" occurs, then 
    return the rows untill and incl that index
    """

    idx = np.argwhere(row.values=="purchase") #get index
    if idx.size>0: #check if it exists
        idx = idx[0][0] 1
        return row[:idx] #return the rows
    return row #else, return the original rows

df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)

#    id action
# 0  1  pause
# 1  1  play
# 2  1  purchase
# 4  2  play
# 5  2  purchase
# 8  3  play
# 9  3  pause