I have the following dataframe of students with their exam scores in different dates (sorted):
df = pd.DataFrame({'student': 'A A A B B B B C C D D'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
datetime.datetime(2013,9,2),],
'score': [15, 27, 32, 22, 43, 20, 44, 40, 33, 15, 20]})
print(df)
student exam_date score
0 A 2013-04-01 15
1 A 2013-06-01 27
2 A 2013-07-01 32
3 B 2013-09-02 22
4 B 2013-10-01 43
5 B 2013-11-02 20
6 B 2014-02-02 44
7 C 2014-05-02 40
8 C 2014-06-02 33
9 D 2013-07-01 15
10 D 2013-09-02 20
I need to keep only those rows where the score is increased by more than 10 from the local minima. The local maxima has to be in the latter date than the local minima date.
For example, for the student A, the local minima is 15 and the score is increased to 27 in the next date, so we're gonna keep that.
For the student B, the score is increased by more than 10 twice, but we're keeping the 1st increase (43).
For the student C, the local minima is 33, but the score isn't increased after that, so we're gonna drop that.
For the student D, the local minima is 15, and the score is increased by only 5, so we're gonna drop that too.
I'm trying the following script:
out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]
print(out)
student exam_date score
1 A 2013-06-01 27
2 A 2013-07-01 32
4 B 2013-10-01 43
6 B 2014-02-02 44
It works here but if I change the B with this:
3 B 2013-09-02 22
4 B 2013-10-01 43
5 B 2013-11-02 40
6 B 2014-02-02 41
.. It returns anything after 22 with the increase of more than 10:
student exam_date score
1 A 2013-06-01 27
2 A 2013-07-01 32
4 B 2013-10-01 43
5 B 2013-11-02 40 #- Shouldn't capture this
6 B 2014-02-02 41 #- Shouldn't capture this
Desired output:
student exam_date score
1 A 2013-06-01 27
4 B 2013-10-01 43
# For A, local maxima of 27 is increased by 12 from local minima of 15
# For B, local maxima of 43 is increased by 21 from local minima of 22
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
Incorporating @Corralien's solution you posted, I've come up with a one-liner that works nicely:
filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()
Output:
>>> filtered
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
CodePudding user response:
We could try the following:
Find the difference between consecutive scores for each student using
groupbydiff.using
where, assign NaN values to all rows where the score difference is less than 10use
groupbyfirstto get the first score differences greater than 10 for each student.
diff = df.groupby('student')['score'].diff()
out = df.where(diff>=10).groupby('student').first().reset_index()
Output:
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
