I have a dataframe of longitudinal data for subjects in a study. Each subject has repeated observations at a different time point Wave. One data point condition_A is expressed as a boolean, where all subjects have condition_A = False at Wave 1. Some subjects go on to develop condition_A = True at varying time points. Other subjects condition_A = False at all time points.
I am trying to find a way to exclude data from subjects with condition_A = True from the time point where condition_A is first True, while preserving any preceding data.
Data frame is set up as below:
>>> index = pd.MultiIndex.from_arrays([[1,1,1,2,2,2,3,3,3,4,4,4],[1,2,3]*4], names = ('SubjectID', 'Wave'))
>>> df = pd.DataFrame({"condition_A":[False, True, True, False, False, True, False, False, False, False, True, False]}, index=index)
>>> df
condition_A ...other columns...
SubjectID Wave
1 1 False
2 True
3 True
2 1 False
2 False
3 True
3 1 False
2 False
3 False
4 1 False
2 True
3 False
The desired output I want is:
condition_A ...other columns...
SubjectID Wave
1 1 False
2 1 False
2 False
3 1 False
2 False
3 False
One approach I tried was to use an external Series object to store the inclusion/exclusion status of each subject, then use apply on each row:
>>> df['condition_A_exclude'] = False
>>> subject_list = pd.Series(False, index=index.levels[0])
Int64Index([1, 2, 3], dtype='int64', name='SubjectID')
>>> def determine_condition_A(row):
... index = row['SubjectID']
... if subject_list.loc[index]:
... row['condition_A_exclude'] = True
... elif row['condition_A']:
... subject_list.loc[index] = True
... row['condition_A_exclude'] = True
... return row
>>> df.apply(determine_condition_A, axis = 'columns')
This approach works, but it relies on an external Series. I wondered if there was a way to accomplish this in place?
CodePudding user response:
To exclude all data (True or False), per SubjectID as soon as there is a True, you can use groupby cummax to compute a mask:
mask = df.groupby(level='SubjectID')['condition_A'].cummax()
df.loc[~mask]
output:
condition_A
SubjectID Wave
1 1 False
2 1 False
2 False
3 1 False
2 False
3 False
CodePudding user response:
I am not sure if I understand it correctly, but in your example data the condition gets True and stays True afterward. This means you simply want to exclude all rows with True as condition:
df[df['condition_A']==False]
Output:
SubjectID Wave
1 1 False
2 1 False
2 False
3 1 False
2 False
3 False
