I have some initial data that looks like this:
code type value
1111 Golf Acceptable
1111 Golf Undesirable
1111 Basketball Acceptable
1111 Basketball Undesirable
1111 Basketball Undesirable
and I'm trying to group it on the code and type columns to get the row with the most occurrences. In the case of a tie, I want to select the row with the value Undesirable. So the example above would become this:
code type value
1111 Golf Undesirable
1111 Basketball Undesirable
Currently I'm doing it this way:
df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))
df = df.sort_values(['type', 'count'])
df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())
I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last() call, and hoping in the case of a tie that Undesirable was sorted last. Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable value?
Performance isn't too much of an issue as I'm only working with around 50k rows or so.
CodePudding user response:
Case 1
If the value column only contains two values i.e. ['Acceptable', 'Undesirable'] then we can rely on the fact that Acceptable < Undesirable alphabetically. In this case you can use the following simplified solution.
Create an auxiliary column called count which contain the count of number of rows per code, type and value. Then sort the dataframe by count and value and drop the dupes per code and type keeping the last row.
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Case 2
If the value column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical type before sorting
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Result
code type value count
1 1111 Golf Undesirable 1
4 1111 Basketball Undesirable 2
CodePudding user response:
Another possible solution, which is based on the following ideas:
Grouping the data by
codeandtype.If a group has more than one row (
len(x) > 1) and its rows have the same count (x['count'] == x['count'].min()).all()), return the row withUndesirable.Otherwise, return the row where the count is maximum (
x.iloc[[x['count'].argmax()]]).
(df.groupby(['code', 'type', 'value'])['value'].size()
.reset_index(name='count').groupby(['code', 'type'])
.apply(lambda x: x.loc[x['value'] == 'Undesirable'] if
((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
x.iloc[[x['count'].argmax()]])
.reset_index(drop=True)
.drop('count', axis=1))
Output:
code type value
0 1111 Basketball Undesirable
1 1111 Golf Undesirable
