Pandas: Censoring subjects in longitudinal data based on criteria in another column-CodePudding

I have a dataframe of longitudinal data for subjects in a study. Each subject has repeated observations at a different time point Wave. One data point condition_A is expressed as a boolean, where all subjects have condition_A = False at Wave 1. Some subjects go on to develop condition_A = True at varying time points. Other subjects condition_A = False at all time points.

I am trying to find a way to exclude data from subjects with condition_A = True from the time point where condition_A is first True, while preserving any preceding data.

Data frame is set up as below:

>>> index = pd.MultiIndex.from_arrays([[1,1,1,2,2,2,3,3,3,4,4,4],[1,2,3]*4], names = ('SubjectID', 'Wave'))
>>> df = pd.DataFrame({"condition_A":[False, True, True, False, False, True, False, False, False, False, True, False]}, index=index)

>>> df
                condition_A    ...other columns...
SubjectID Wave
1         1           False
          2            True
          3            True
2         1           False
          2           False
          3            True
3         1           False
          2           False
          3           False
4         1           False
          2            True
          3           False

The desired output I want is:

                condition_A    ...other columns...
SubjectID Wave
1         1           False
2         1           False
          2           False
3         1           False
          2           False
          3           False

One approach I tried was to use an external Series object to store the inclusion/exclusion status of each subject, then use apply on each row:

>>> df['condition_A_exclude'] = False
>>> subject_list = pd.Series(False, index=index.levels[0])
Int64Index([1, 2, 3], dtype='int64', name='SubjectID')
>>> def determine_condition_A(row):
...     index = row['SubjectID']
...     if subject_list.loc[index]:
...         row['condition_A_exclude'] = True
...     elif row['condition_A']:
...          subject_list.loc[index] = True
...          row['condition_A_exclude'] = True
...     return row

>>> df.apply(determine_condition_A, axis = 'columns')

This approach works, but it relies on an external Series. I wondered if there was a way to accomplish this in place?

CodePudding user response：

To exclude all data (True or False), per SubjectID as soon as there is a True, you can use groupby cummax to compute a mask:

mask = df.groupby(level='SubjectID')['condition_A'].cummax()

df.loc[~mask]

output:

                condition_A
SubjectID Wave             
1         1           False
2         1           False
          2           False
3         1           False
          2           False
          3           False

CodePudding user response：

I am not sure if I understand it correctly, but in your example data the condition gets True and stays True afterward. This means you simply want to exclude all rows with True as condition:

df[df['condition_A']==False]

Output:

SubjectID   Wave    
1   1   False
2   1   False
    2   False
3   1   False
    2   False
    3   False