I am new to pandas dataframes, so I apologies in case there's an easy or even built-in way to do so.
Let's say I have a dataframe df with 3 columns A (a string), B (a float) and C (a bool). Values of column A are not unique. B is a random number and rows with same A value can have different values of B. Columns C is True if the value of A is repeated in the dataset.
An example
| | A | B | C |
|---|-----|-----|-------|
| 0 | cat | 10 | True |
| 1 | dog | 10 | False |
| 2 | cat | 20 | True |
| 3 | bee | 100 | False |
(The column C is actually redundant and could be obtained with df['C']=df['A'].duplicated(keep=False))
What I want to obtain is a dataframe were, for duplicated entries of A (C==True), only the row with the highest B value is kept.
I know how to get the list of rows with maximum value of B:
df.loc[df[df['C']].groupby('A')['B'].idxmax()] #is this the best way actually?
but what I want is the opposite: filter df so to get only the entries not duplicated (C==False) and the duplicated ones with the highest B.
One possibility could be to concatenate df[~df['C']] and the previous table but is it the best way actually?
CodePudding user response:
One approach:
res = df.iloc[df.groupby("A")["B"].idxmax()]
print(res)
Output
A B C
3 bee 100 False
2 cat 20 True
1 dog 10 False
