I have a df and I want, for each line (which correspond to a month), to have a list of 50% of the highest values of "B" in that line.
| month | A | B |
|---|---|---|
| 1994-07 | A | 50 |
| 1994-07 | B | 60 |
| 1994-07 | C | 70 |
| 1994-07 | D | 80 |
| 1994-07 | E | NAN |
| 1994-07 | F | NAN |
| 1994-08 | A | 90 |
| 1994-08 | B | 60 |
| 1994-08 | C | 70 |
| 1994-08 | D | 95 |
| 1994-08 | E | 100 |
| 1994-08 | F | 110 |
| 1994-08 | G | NAN |
For July/1994 I only have 4 columns "B" filled with values, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:
| month | A | B |
|---|---|---|
| 1994-07 | C | 70 |
| 1994-07 | D | 80 |
| 1994-08 | D | 95 |
| 1994-08 | E | 100 |
| 1994-08 | F | 110 |
I have tried:
df = df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nsmallest(len(g)//2, 'B'))
However, it does not ignore "NAN" and count it as a number. For example, for July/1994 it counts 6 values, and so it returns me the 3 (50% of 6) the highest values from the month. Instead, it should count that there are 4 values and return me the 2 highest.
CodePudding user response:
IIUC, you want to use g['B'].count() as count ignores NaNs:
(df
.groupby('month')
.apply(lambda g: g.nlargest(g['B'].count()//2, columns='B'))
.droplevel(0).sort_index()
)
Alternatively, dropna first:
(df
.dropna(subset='B')
.groupby('month')
.apply(lambda g: g.nlargest(len(g)//2, 'B'))
.droplevel(0).sort_index()
)
output:
month A B
2 1994-07 C 70.0
3 1994-07 D 80.0
9 1994-08 D 95.0
10 1994-08 E 100.0
11 1994-08 F 110.0
CodePudding user response:
Alternatively, you could pass median to groupby.transform, then filter values greater than the median (i.e. top 50%). Since median method skips NaN by default, no issues there.
out = df[df['B'] > df.groupby('month')['B'].transform('median')]
Output:
month A B
2 1994-07 C 70.0
3 1994-07 D 80.0
9 1994-08 D 95.0
10 1994-08 E 100.0
11 1994-08 F 110.0
