Apply condition on a column after groupby in pandas and then aggregate to get 2 max value-CodePudding

data    field   bcorr
0   A   cs1 0.8
1   A   cs2 0.9
2   A   cs3 0.7
3   A   pq1 0.4
4   A   pq2 0.6
5   A   pq3 0.5
6   B   cs1 0.8
7   B   cs2 0.9
8   B   cs3 0.7
9   B   pq1 0.4
10  B   pq2 0.6
11  B   pq3 0.5

For every data A and B in data column, segregate the cs & pq fields from field column, and then aggregate to get 2 max value of bcorr.

Sample result would be like:

data    field   bcorr
0   A   cs1 0.8
1   A   cs2 0.9
4   A   pq2 0.6
5   A   pq3 0.5
6   B   cs1 0.8
7   B   cs2 0.9
10  B   pq2 0.6
11  B   pq3 0.5

For this, one of option is to do this while creating the list of records, which obviously will have high complexity.

second, i want to do this with pandas dataframe, where i used groupby on data column, then applying startswith to get the source field and then apply max

CodePudding user response：

First, extract common part of each field (first letters) then sort values (highest values go bottom). Finally group by data column and field series then keep the two last values (the highest):

field = df['field'].str.extract('([^\d] )', expand=False)
out = df.sort_values('bcorr').groupby(['data', field]).tail(2).sort_index()
print(out)

# Output
   data field  bcorr
0     A   cs1    0.8
1     A   cs2    0.9
4     A   pq2    0.6
5     A   pq3    0.5
6     B   cs1    0.8
7     B   cs2    0.9
10    B   pq2    0.6
11    B   pq3    0.5

If you field have only two fixed letters to determine the field, you can use df['field'].str[:2] instead of df['field'].str.extract(...).

CodePudding user response：

You can groupby on date and substring field column using str[:2] which grabs the characters up to the 2nd, and use head(2).

head returns the first n rows, so you will need to sort the data before.

df.sort_values(by=['data','bcorr'],ascending=False).groupby(['data',df.field.str[:2]]).head(2).sort_index()

  data field  bcorr
0     A   cs1    0.8
1     A   cs2    0.9
4     A   pq2    0.6
5     A   pq3    0.5
6     B   cs1    0.8
7     B   cs2    0.9
10    B   pq2    0.6
11    B   pq3    0.5

Following the above logic, using tail(2) and sorting the data the other way around get's you the same output:

df.sort_values(by=['data','bcorr']).groupby(['data',df.field.str[:2]]).tail(2).sort_index()

EDIT If you want to generalize to allow for arbitrary number of non digit characters in your field column, you could use str.replace to replace all numeric characters with empty in your groupby:

df.sort_values(by=['data','bcorr']).groupby(['data',df.field.str.replace(r"[0-9]",'')]).tail(2).sort_index()

CodePudding user response：

Here's one way:

(i) Create a new column field_name from field by selecting first elements and use groupby on data and field_name and use nlargest to find two largest values for each group to create Series temp

(ii) Using index of temp created in (i), filter df

(iii) Assign temp to the bcorr column of the filtered Dataframe from (ii)

temp = df.assign(field_name=df['field'].str[:2]).groupby(['data','field_name'])['bcorr'].nlargest(2).droplevel([0,1]).sort_index()
out = df.loc[temp.index]
out['bcorr'] = temp

Output:

   data field  bcorr
0     A   cs1    0.8
1     A   cs2    0.9
4     A   pq2    0.6
5     A   pq3    0.5
6     B   cs1    0.8
7     B   cs2    0.9
10    B   pq2    0.6
11    B   pq3    0.5

CodePudding user response：

I believe this is something you are trying to achieve

import pandas as pd

df = {'data': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
      'fie': ['cs1', 'cs2', 'cs3', 'pq1', 'pq2', 'pq3', 'cs1', 'cs2', 'cs3', 'pq1', 'pq2', 'pq3'],
      'bcorr': [0.8, 0.9, 0.7, 0.4, 0.6, 0.5, 0.8, 0.9, 0.7, 0.4, 0.6, 0.5]}

df = {'data': df['data'], 'fie_c': [x[:2] for x in df['fie']], 'fie_n': [x[2] for x in df['fie']],
      'bcorr': df['bcorr']}
df = pd.DataFrame(data=df)
df = df.sort_values('bcorr', ascending=False).groupby(by=['data', 'fie_c']).head(2).sort_values('data')
df['fie'] = df[['fie_c', 'fie_n']].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
df = df.drop(columns=['fie_c', 'fie_n'])
df = df[['data', 'fie', 'bcorr']]
print(df)

Output

   data  fie  bcorr
1     A  cs2    0.9
0     A  cs1    0.8
4     A  pq2    0.6
5     A  pq3    0.5
7     B  cs2    0.9
6     B  cs1    0.8
10    B  pq2    0.6
11    B  pq3    0.5

Note that the first few lines can be more cleaner but my focus was on the line

df = df.sort_values('bcorr', ascending=False).groupby(by=['data', 'fie_c']).head(2).sort_values('data')

which does most of the important work.