I have a dataframe and would like to aggregate based on if some values in column Result exist or not. So if for any ìndex1 and ìndex2 there is an A then my total_result column should be A. If there is no A but there is a B then my total_result column should be B, etc.
Below I have some working code but I was wondering if there is no easier way to achieve the same?
df = pd.DataFrame({'index1':[1,2,2,2,3,3,4], 'index2':[1,2,2,2,3,3,4], 'Result':['A','','B','As','C','B','D']})
df_out = (df.groupby(['index1', 'index2'])
.agg(A=('Result', lambda x: (x == 'A').any() | (x == 'As').any()),
B=('Result', lambda x: (x == 'B').any()),
C=('Result', lambda x: (x == 'C').any()),
D=('Result', lambda x: (x == 'D').any()))
.reset_index()
.assign(tot_result = lambda x: np.where(x['A'], 'A',
np.where(x['B'], 'B',
np.where(x['C'], 'C', 'D'))))
.drop(columns=['A', 'B', 'C', 'D'])
)
CodePudding user response:
Solution if always match at least one value - First create helper columns and then aggregate DataFrameGroupBy.any, for first match column name use DataFrame.idxmax:
df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
B = (df.Result == 'B'),
C = (df.Result == 'C'),
D = (df.Result == 'D'))
.groupby(['index1', 'index2'])[['A','B','C','D']]
.any()
.idxmax(axis=1)
.reset_index(name='tot_result'))
print (df1)
If possible no match:
df = pd.DataFrame({'index1':[1,2,2,2,3,3,4],
'index2':[1,2,2,2,3,3,4],
'Result':['A','','B','As','C','B','E']})
df1 = (df.assign(A = (df.Result == 'A') | (df.Result == 'As'),
B = (df.Result == 'B'),
C = (df.Result == 'C'),
D = (df.Result == 'D'))
.groupby(['index1', 'index2'])[['A','B','C','D']]
.any()
.replace({True:1, False:None})
.apply(lambda x: x.first_valid_index(), axis=1)
.reset_index(name='tot_result'))
print (df1)
index1 index2 tot_result
0 1 1 A
1 2 2 A
2 3 3 B
3 4 4 None
