Pandas GroupBy and then choose one result using if or max-CodePudding

I have a dataframe of ids, states, places, types and populations:

id	state	place	type	population
A	ontario	toronto	Place	5429524.0
A	ontario	ottawa	Capital	989567.0
B	saskatchewan	saskatoon	Place	246376.0
B	saskatchewan	regina	Place	215106.0
B	saskatchewan	moose jaw	Place	33890.0

I groupby to get all states with the same ID:

grouped_df = locations_df.groupby(["id", "state"])

I then need to either get the row that has ['type']== 'Capital' OR if that doesn't exist in that group, then choose the row with the highest population. e.g.

id	state	place	type	population
A	ontario	ottawa	Capital	989567.0
B	saskatchewan	saskatoon	Place	246376.0

I can get these individually using:

locations_df.groupby(["id", "state"]).population.idxmax()

locations_df.groupby(["id", "state"]).filter(lambda x: (x['type']== 'Capital').any())

Is there a way to combine this into one call?

CodePudding user response：

You could use the fact that there are only two types and sort_values (ascending for "type" to have "Capital" first and descending for "population" to have the highest first):

(df.sort_values(by=['type', 'population'], ascending=[True, False])
   .groupby('id').first()
)

output:

            state       place      type  population
id                                                 
A        ontario      ottawa   Capital     989567.0
B   saskatchewan   saskatoon     Place     246376.0

CodePudding user response：

Sort values by type and population:

type by ascending order because 'C' is before 'P'
population by descending order to have highest values at top

Then group by id and state columns and finally take the first value of each group:

out = df.sort_values(['type', 'population'], ascending=[True, False]) \
        .groupby(['id', 'state']).first()
print(out)

# Output
  id         state      place     type  population
1  A       ontario     ottawa  Capital    989567.0
2  B  saskatchewan  saskatoon    Place    246376.0

CodePudding user response：

Here's an alternate solution, just for fun:

filtered = df.loc[(df.groupby(['id', 'state'])['population'].transform('idxmax').eq(df.index)   df['type'].eq('Capital').mul(2)).groupby([df['id'], df['state']]).idxmax()]

Output:

>>> filtered
  id         state      place     type  population
1  A       ontario     ottawa  Capital    989567.0
2  B  saskatchewan  saskatoon    Place    246376.0