I have the following dataframe:
id outcome
0 3 no
1 3 no
2 3 no
3 3 yes
4 3 no
5 5 no
6 5 no
7 5 yes
8 5 no
9 5 yes
10 6 no
11 6 no
12 6 yes
13 6 no
14 6 no
I want to remove the no outcomes at the start of a sequence before a yes, and keep all other no outcomes, so the output dataframe looks like this:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
At the moment I have tried this:
df = pd.DataFrame(data={
'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6],
'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
})
df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]
However, this simply removes all no outcomes.
I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?
CodePudding user response:
Use groupby with cumsum to mark all 'no' at the start with a 0:
df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())
Now, the number of 'no's to remove is:
num_no_to_remove = (df['no_group'] == 0).sum()
And the wanted dataframe can be obtained by filtering:
df.loc[df['no_group'] > 0].drop(columns=['no_group'])
Result:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
CodePudding user response:
For keeping only last no values of each group and all the yes values, this code will do the trick:
df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]
Output:
>>> df
id outcome
3 3 yes
4 3 no
5 3 no
8 5 yes
9 5 yes
12 6 yes
(In the original df, I added a second no the end of group 3 to make sure it works for multiple no's at the end).
Essentially what the code does is it
- Replaces
yesvalues with an arbitrary value (1) in this case - Replaces
novalues with NaN (which is important!) - Groups the rows by their ID
- For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the
yes's are1and theno's are NaN, this will cause everything except the lastno's of the group to be replaced with the arbitrary number (1) - Creates a mask which selects all those last
novalues of each group - Creates a second mask which selects all
yesvalues - Uses those two masks combined to return , and all
yesvalues, and allnovalues that are at the end of a group.
For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.
