I am trying to use NLP techniques to predict the mental status of patients at the time doctor's notes were taken. I have two classes, suffix (multilabel) as well as mental_status (binary, Lucid or Delirium). I am trying to build train and test sets from a reduced list that meet the following specifications:
- Shared unique suffix list
- For each set for each suffix, at least one row for Lucid and Delirium
Any imbalances I will correct with sample weights (I am only trying to predict mental_status, but I need the data to also be balanced on suffix). I just need at least one row for each combination within the full Cartesian product of suffixes and mental_statuses.
For example, in the dataset below, only the DWM Communication suffixes would make it into the train-test split.
suffix mental_status
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
DWN Psychiatry Lucid
DWN Psychiatry Delirium
DWN Psychiatry Delirium
DWN Cardio stuff Lucid
DWN Cardio stuff Delirium
DWN Blood ..... Lucid
DWN Blood ..... Lucid
DWN Blood ..... Lucid
The output I would want for the example above for train would be
DWN Communication Lucid
DWN Communication Delirium
DWN Communication Lucid
and test
DWN Communication Lucid
DWN Communication Delirium
I've tried
df.groupby(['suffix','mental_status']).filter(lambda x: len(x) > 1)
but it doesn't ensure there is one of each Delirium and Lucid for each suffix.
As another example,
suf ms
11 blood delirium
15 blood delirium
0 blood lucid
1 blood lucid
5 blood lucid
6 blood lucid
10 blood lucid
16 blood lucid
3 psych delirium
19 psych delirium
4 psych lucid
8 psych lucid
9 psych lucid
13 psych lucid
14 psych lucid
18 psych lucid
7 stool delirium
2 stool lucid
12 stool lucid
17 stool lucid
would be split into
suf ms
11 blood delirium
0 blood lucid
1 blood lucid
5 blood lucid
6 blood lucid
3 psych delirium
4 psych lucid
8 psych lucid
9 psych lucid
13 psych lucid
and
suf ms
15 blood delirium
10 blood lucid
16 blood lucid
19 psych delirium
14 psych lucid
18 psych lucid
CodePudding user response:
Does it work if you just do df.drop_duplicates(inplace=True)? It looks like that’s all there is to it because you just want the unique values based on the combination of both columns.
CodePudding user response:
use a mask filter
mask=df["DWN" in df['suffix'] ]
df=df[mask]
df.dropna(inplace=True)
CodePudding user response:
what about using the index and duplicated(keep=False)?
test = df[~df.duplicated(keep=False)]
output = df.drop_duplicates()
output = output[~output.index.isin(test.index)]
