Home > Software engineering >  Filter Dataframe Based on Local Minima and Local Maxima with Increasing Timeline
Filter Dataframe Based on Local Minima and Local Maxima with Increasing Timeline

Time:02-02

I have the following dataframe of students with their exam scores in different dates (sorted):

df = pd.DataFrame({'student': 'A A A B B B B C C D D'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
                               datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 27, 32, 22, 43, 20, 44, 40, 33, 15, 20]})

print(df)

   student  exam_date  score
0        A 2013-04-01     15
1        A 2013-06-01     27
2        A 2013-07-01     32
3        B 2013-09-02     22
4        B 2013-10-01     43
5        B 2013-11-02     20
6        B 2014-02-02     44
7        C 2014-05-02     40
8        C 2014-06-02     33
9        D 2013-07-01     15
10       D 2013-09-02     20

I need to keep only those rows where the score is increased by more than 10 from the local minima. The local maxima has to be in the latter date than the local minima date.

For example, for the student A, the local minima is 15 and the score is increased to 27 in the next date, so we're gonna keep that.

For the student B, the score is increased by more than 10 twice, but we're keeping the 1st increase (43).

For the student C, the local minima is 33, but the score isn't increased after that, so we're gonna drop that.

For the student D, the local minima is 15, and the score is increased by only 5, so we're gonna drop that too.

I'm trying the following script:

out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]

print(out)
   student  exam_date  score
1        A 2013-06-01     27
2        A 2013-07-01     32
4        B 2013-10-01     43
6        B 2014-02-02     44

It works here but if I change the B with this:

3        B 2013-09-02     22
4        B 2013-10-01     43
5        B 2013-11-02     40
6        B 2014-02-02     41

.. It returns anything after 22 with the increase of more than 10:

    student     exam_date   score
1   A   2013-06-01  27
2   A   2013-07-01  32
4   B   2013-10-01  43
5   B   2013-11-02  40 #- Shouldn't capture this
6   B   2014-02-02  41 #- Shouldn't capture this

Desired output:

   student  exam_date  score
1        A 2013-06-01     27
4        B 2013-10-01     43

# For A, local maxima of 27 is increased by 12 from local minima of 15  
# For B, local maxima of 43 is increased by 21 from local minima of 22 

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response:

Incorporating @Corralien's solution you posted, I've come up with a one-liner that works nicely:

filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()

Output:

>>> filtered
  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

CodePudding user response:

We could try the following:

  1. Find the difference between consecutive scores for each student using groupby diff.

  2. using where, assign NaN values to all rows where the score difference is less than 10

  3. use groupby first to get the first score differences greater than 10 for each student.

diff = df.groupby('student')['score'].diff()
out = df.where(diff>=10).groupby('student').first().reset_index()

Output:

  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0
  •  Tags:  
  • Related