data field bcorr
0 A cs1 0.8
1 A cs2 0.9
2 A cs3 0.7
3 A pq1 0.4
4 A pq2 0.6
5 A pq3 0.5
6 B cs1 0.8
7 B cs2 0.9
8 B cs3 0.7
9 B pq1 0.4
10 B pq2 0.6
11 B pq3 0.5
For every data A and B in data column, segregate the cs & pq fields from field column, and then aggregate to get 2 max value of bcorr.
Sample result would be like:
data field bcorr
0 A cs1 0.8
1 A cs2 0.9
4 A pq2 0.6
5 A pq3 0.5
6 B cs1 0.8
7 B cs2 0.9
10 B pq2 0.6
11 B pq3 0.5
For this, one of option is to do this while creating the list of records, which obviously will have high complexity.
second, i want to do this with pandas dataframe, where i used groupby on data column, then applying startswith to get the source field and then apply max
CodePudding user response:
First, extract common part of each field (first letters) then sort values (highest values go bottom). Finally group by data column and field series then keep the two last values (the highest):
field = df['field'].str.extract('([^\d] )', expand=False)
out = df.sort_values('bcorr').groupby(['data', field]).tail(2).sort_index()
print(out)
# Output
data field bcorr
0 A cs1 0.8
1 A cs2 0.9
4 A pq2 0.6
5 A pq3 0.5
6 B cs1 0.8
7 B cs2 0.9
10 B pq2 0.6
11 B pq3 0.5
If you field have only two fixed letters to determine the field, you can use df['field'].str[:2] instead of df['field'].str.extract(...).
CodePudding user response:
You can groupby on date and substring field column using str[:2] which grabs the characters up to the 2nd, and use head(2).
head returns the first n rows, so you will need to sort the data before.
df.sort_values(by=['data','bcorr'],ascending=False).groupby(['data',df.field.str[:2]]).head(2).sort_index()
data field bcorr
0 A cs1 0.8
1 A cs2 0.9
4 A pq2 0.6
5 A pq3 0.5
6 B cs1 0.8
7 B cs2 0.9
10 B pq2 0.6
11 B pq3 0.5
Following the above logic, using tail(2) and sorting the data the other way around get's you the same output:
df.sort_values(by=['data','bcorr']).groupby(['data',df.field.str[:2]]).tail(2).sort_index()
EDIT
If you want to generalize to allow for arbitrary number of non digit characters in your field column, you could use str.replace to replace all numeric characters with empty in your groupby:
df.sort_values(by=['data','bcorr']).groupby(['data',df.field.str.replace(r"[0-9]",'')]).tail(2).sort_index()
CodePudding user response:
Here's one way:
(i) Create a new column field_name from field by selecting first elements and use groupby on data and field_name and use nlargest to find two largest values for each group to create Series temp
(ii) Using index of temp created in (i), filter df
(iii) Assign temp to the bcorr column of the filtered Dataframe from (ii)
temp = df.assign(field_name=df['field'].str[:2]).groupby(['data','field_name'])['bcorr'].nlargest(2).droplevel([0,1]).sort_index()
out = df.loc[temp.index]
out['bcorr'] = temp
Output:
data field bcorr
0 A cs1 0.8
1 A cs2 0.9
4 A pq2 0.6
5 A pq3 0.5
6 B cs1 0.8
7 B cs2 0.9
10 B pq2 0.6
11 B pq3 0.5
CodePudding user response:
I believe this is something you are trying to achieve
import pandas as pd
df = {'data': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'fie': ['cs1', 'cs2', 'cs3', 'pq1', 'pq2', 'pq3', 'cs1', 'cs2', 'cs3', 'pq1', 'pq2', 'pq3'],
'bcorr': [0.8, 0.9, 0.7, 0.4, 0.6, 0.5, 0.8, 0.9, 0.7, 0.4, 0.6, 0.5]}
df = {'data': df['data'], 'fie_c': [x[:2] for x in df['fie']], 'fie_n': [x[2] for x in df['fie']],
'bcorr': df['bcorr']}
df = pd.DataFrame(data=df)
df = df.sort_values('bcorr', ascending=False).groupby(by=['data', 'fie_c']).head(2).sort_values('data')
df['fie'] = df[['fie_c', 'fie_n']].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
df = df.drop(columns=['fie_c', 'fie_n'])
df = df[['data', 'fie', 'bcorr']]
print(df)
Output
data fie bcorr
1 A cs2 0.9
0 A cs1 0.8
4 A pq2 0.6
5 A pq3 0.5
7 B cs2 0.9
6 B cs1 0.8
10 B pq2 0.6
11 B pq3 0.5
Note that the first few lines can be more cleaner but my focus was on the line
df = df.sort_values('bcorr', ascending=False).groupby(by=['data', 'fie_c']).head(2).sort_values('data')
which does most of the important work.
