Finding top n sub-groups under a main group based on values of a numeric column in Python-CodePudding

I have a data as follows:

Category  Brand Sales
Furniture  F1    26
Supplies   S2    12
Tech       T3    17
Furniture  F5    21
Supplies   S7    14
Tech       T2    23
Furniture  F7    24
Supplies   S4    17
Tech       T4    28

I want an output using pandas which gives me top 2 Brands for each Category based on Sales as follows:

Category  Brand Sales
Furniture F1     26
          F7     24
Supplies  S4     17
          S7     14
Tech      T4     28
          T2     23

Please if anyone can help me with this as I'm stuck. Thanks in advance.

CodePudding user response：

Sort by Sales in descending order and then use groupby head(2) to get the top 2 values:

(df.sort_values('Sales', ascending=False)
   .groupby('Category')
   .head(2)
   .sort_values('Category'))

    Category Brand  Sales
0  Furniture    F1     26
6  Furniture    F7     24
7   Supplies    S4     17
4   Supplies    S7     14
8       Tech    T4     28
5       Tech    T2     23

Another option is to use nlargest in each group:

df.groupby('Category', group_keys=False).apply(lambda g: g.nlargest(2, 'Sales'))

    Category Brand  Sales
0  Furniture    F1     26
6  Furniture    F7     24
7   Supplies    S4     17
4   Supplies    S7     14
8       Tech    T4     28
5       Tech    T2     23

CodePudding user response：

You can get the index of the largest elements using something like

idx = df.groupby('Category')['Sales'].nlargest(2).index

Now you can index the original dataframe with the second level of the index, which contains the original row indices:

df.loc[idx.get_level_values(1)]

The nice thing about this approach is that it is O(n) rather than O(n log n), since it uses partitioning instead of sorting, which will matter for large datasets.

Of course you can write it all in one line:

df.loc[df.groupby('Category')['Sales'].nlargest(2).index.get_level_values(1)]