I have a data as follows:
Category Brand Sales
Furniture F1 26
Supplies S2 12
Tech T3 17
Furniture F5 21
Supplies S7 14
Tech T2 23
Furniture F7 24
Supplies S4 17
Tech T4 28
I want an output using pandas which gives me top 2 Brands for each Category based on Sales as follows:
Category Brand Sales
Furniture F1 26
F7 24
Supplies S4 17
S7 14
Tech T4 28
T2 23
Please if anyone can help me with this as I'm stuck. Thanks in advance.
CodePudding user response:
Sort by Sales in descending order and then use groupby head(2) to get the top 2 values:
(df.sort_values('Sales', ascending=False)
.groupby('Category')
.head(2)
.sort_values('Category'))
Category Brand Sales
0 Furniture F1 26
6 Furniture F7 24
7 Supplies S4 17
4 Supplies S7 14
8 Tech T4 28
5 Tech T2 23
Another option is to use nlargest in each group:
df.groupby('Category', group_keys=False).apply(lambda g: g.nlargest(2, 'Sales'))
Category Brand Sales
0 Furniture F1 26
6 Furniture F7 24
7 Supplies S4 17
4 Supplies S7 14
8 Tech T4 28
5 Tech T2 23
CodePudding user response:
You can get the index of the largest elements using something like
idx = df.groupby('Category')['Sales'].nlargest(2).index
Now you can index the original dataframe with the second level of the index, which contains the original row indices:
df.loc[idx.get_level_values(1)]
The nice thing about this approach is that it is O(n) rather than O(n log n), since it uses partitioning instead of sorting, which will matter for large datasets.
Of course you can write it all in one line:
df.loc[df.groupby('Category')['Sales'].nlargest(2).index.get_level_values(1)]
