Custom train-test split using two stratified classes-CodePudding

I am trying to use NLP techniques to predict the mental status of patients at the time doctor's notes were taken. I have two classes, suffix (multilabel) as well as mental_status (binary, Lucid or Delirium). I am trying to build train and test sets from a reduced list that meet the following specifications:

Shared unique suffix list
For each set for each suffix, at least one row for Lucid and Delirium

Any imbalances I will correct with sample weights (I am only trying to predict mental_status, but I need the data to also be balanced on suffix). I just need at least one row for each combination within the full Cartesian product of suffixes and mental_statuses.

For example, in the dataset below, only the DWM Communication suffixes would make it into the train-test split.

suffix              mental_status
DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid
DWN Psychiatry      Lucid
DWN Psychiatry      Delirium
DWN Psychiatry      Delirium
DWN Cardio stuff    Lucid
DWN Cardio stuff    Delirium
DWN Blood .....     Lucid
DWN Blood .....     Lucid
DWN Blood .....     Lucid

The output I would want for the example above for train would be

DWN Communication   Lucid
DWN Communication   Delirium
DWN Communication   Lucid

and test

DWN Communication   Lucid
DWN Communication   Delirium

I've tried

df.groupby(['suffix','mental_status']).filter(lambda x: len(x) > 1)

but it doesn't ensure there is one of each Delirium and Lucid for each suffix.

As another example,

    suf ms
11  blood   delirium
15  blood   delirium
0   blood   lucid
1   blood   lucid
5   blood   lucid
6   blood   lucid
10  blood   lucid
16  blood   lucid
3   psych   delirium
19  psych   delirium
4   psych   lucid
8   psych   lucid
9   psych   lucid
13  psych   lucid
14  psych   lucid
18  psych   lucid
7   stool   delirium
2   stool   lucid
12  stool   lucid
17  stool   lucid

would be split into

    suf ms
11  blood   delirium
0   blood   lucid
1   blood   lucid
5   blood   lucid
6   blood   lucid
3   psych   delirium
4   psych   lucid
8   psych   lucid
9   psych   lucid
13  psych   lucid

and

    suf ms
15  blood   delirium
10  blood   lucid
16  blood   lucid
19  psych   delirium
14  psych   lucid
18  psych   lucid

CodePudding user response：

Does it work if you just do df.drop_duplicates(inplace=True)? It looks like that’s all there is to it because you just want the unique values based on the combination of both columns.

CodePudding user response：

use a mask filter
mask=df["DWN" in df['suffix']  ]
df=df[mask]
df.dropna(inplace=True)

CodePudding user response：

what about using the index and duplicated(keep=False)?

test = df[~df.duplicated(keep=False)]
output  = df.drop_duplicates()
output = output[~output.index.isin(test.index)]