EDITED:
I have the following dataframe of students with their exam scores in different dates (sorted):
df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
datetime.datetime(2013,9,2),],
'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})
print(df)
student exam_date score
0 A 2013-04-01 15
1 A 2013-06-01 17
2 A 2013-07-01 32
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
7 C 2013-07-01 33
8 C 2013-09-02 15
I need to keep only those rows where the score is increased by more than 10 from the local minima.
For example, for the student A, the local minima is 15 and the score is increased to 32 in the next-to-to date, so we're gonna keep that.
For the student B, no score is increased by more than 10 from local minima. 28-22 and 33-24 both are less than 10.
For the student C, the local minima is 15, but the score isn't increased after that, so we're gonna drop that.
I'm trying the following script:
out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]
print(out)
2 A 2013-07-01 32
6 B 2014-02-02 33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`
Desired output:
student exam_date score
2 A 2013-07-01 32
# For A, score of 32 is increased by 17 from local minima of 15
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
Incorporating @Corralien's solution you posted, I've come up with a one-liner that works nicely:
filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()
Output:
>>> filtered
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
CodePudding user response:
We could try the following:
Find the difference between consecutive scores for each student using
groupbydiff.using
where, assign NaN values to all rows where the score difference is less than 10use
groupbyfirstto get the first score differences greater than 10 for each student.
msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()
Output:
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
CodePudding user response:
Assuming your dataframe is already sorted by date:
highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)
# Output
student exam_date score
2 A 2013-07-01 32
Focus on lambda function
Let's modify your dataframe and extract one student to avoid groupby:
>>> df = df[df['student'] == 'B']
student exam_date score
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3 False
4 True
5 False
6 True
Name: score, dtype: bool
# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 NaN
5 24.0
6 NaN
Name: score, dtype: float64
# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 22.0
5 24.0
6 24.0
Name: score, dtype: float64
# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3 False
4 False
5 False
6 False
Name: score, dtype: bool
