I have a dataframe of ids, states, places, types and populations:
| id | state | place | type | population |
|---|---|---|---|---|
| A | ontario | toronto | Place | 5429524.0 |
| A | ontario | ottawa | Capital | 989567.0 |
| B | saskatchewan | saskatoon | Place | 246376.0 |
| B | saskatchewan | regina | Place | 215106.0 |
| B | saskatchewan | moose jaw | Place | 33890.0 |
I groupby to get all states with the same ID:
grouped_df = locations_df.groupby(["id", "state"])
I then need to either get the row that has ['type']== 'Capital' OR if that doesn't exist in that group, then choose the row with the highest population. e.g.
| id | state | place | type | population |
|---|---|---|---|---|
| A | ontario | ottawa | Capital | 989567.0 |
| B | saskatchewan | saskatoon | Place | 246376.0 |
I can get these individually using:
locations_df.groupby(["id", "state"]).population.idxmax()
locations_df.groupby(["id", "state"]).filter(lambda x: (x['type']== 'Capital').any())
Is there a way to combine this into one call?
CodePudding user response:
You could use the fact that there are only two types and sort_values (ascending for "type" to have "Capital" first and descending for "population" to have the highest first):
(df.sort_values(by=['type', 'population'], ascending=[True, False])
.groupby('id').first()
)
output:
state place type population
id
A ontario ottawa Capital 989567.0
B saskatchewan saskatoon Place 246376.0
CodePudding user response:
Sort values by type and population:
typeby ascending order because 'C' is before 'P'populationby descending order to have highest values at top
Then group by id and state columns and finally take the first value of each group:
out = df.sort_values(['type', 'population'], ascending=[True, False]) \
.groupby(['id', 'state']).first()
print(out)
# Output
id state place type population
1 A ontario ottawa Capital 989567.0
2 B saskatchewan saskatoon Place 246376.0
CodePudding user response:
Here's an alternate solution, just for fun:
filtered = df.loc[(df.groupby(['id', 'state'])['population'].transform('idxmax').eq(df.index) df['type'].eq('Capital').mul(2)).groupby([df['id'], df['state']]).idxmax()]
Output:
>>> filtered
id state place type population
1 A ontario ottawa Capital 989567.0
2 B saskatchewan saskatoon Place 246376.0
