Filter Dataframe Based on Local Minima with Increasing Timeline-CodePudding

EDITED:

I have the following dataframe of students with their exam scores in different dates (sorted):

df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})

print(df)

  student  exam_date  score
0       A 2013-04-01     15
1       A 2013-06-01     17
2       A 2013-07-01     32
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33
7       C 2013-07-01     33
8       C 2013-09-02     15

I need to keep only those rows where the score is increased by more than 10 from the local minima.

For example, for the student A, the local minima is 15 and the score is increased to 32 in the next-to-to date, so we're gonna keep that.

For the student B, no score is increased by more than 10 from local minima. 28-22 and 33-24 both are less than 10.

For the student C, the local minima is 15, but the score isn't increased after that, so we're gonna drop that.

I'm trying the following script:

out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]

print(out)
2   A   2013-07-01  32
6   B   2014-02-02  33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`

Desired output:

   student  exam_date  score
2        A  2013-07-01  32

# For A, score of 32 is increased by 17 from local minima of 15

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

Incorporating @Corralien's solution you posted, I've come up with a one-liner that works nicely:

filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()

Output:

>>> filtered
  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

CodePudding user response：

We could try the following:

Find the difference between consecutive scores for each student using groupby diff.
using where, assign NaN values to all rows where the score difference is less than 10
use groupby first to get the first score differences greater than 10 for each student.

msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()

Output:

  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

CodePudding user response：

Assuming your dataframe is already sorted by date:

highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output
  student  exam_date  score
2       A 2013-07-01     32

Focus on lambda function

Let's modify your dataframe and extract one student to avoid groupby:

>>> df = df[df['student'] == 'B']
  student  exam_date  score
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33

# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3    False
4     True
5    False
6     True
Name: score, dtype: bool

# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4     NaN
5    24.0
6     NaN
Name: score, dtype: float64

# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4    22.0
5    24.0
6    24.0
Name: score, dtype: float64

# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3    False
4    False
5    False
6    False
Name: score, dtype: bool