Pandas Top N% of values within each group while ignoring missing values-CodePudding

I have a df and I want, for each line (which correspond to a month), to have a list of 50% of the highest values of "B" in that line.

month	A	B
1994-07	A	50
1994-07	B	60
1994-07	C	70
1994-07	D	80
1994-07	E	NAN
1994-07	F	NAN
1994-08	A	90
1994-08	B	60
1994-08	C	70
1994-08	D	95
1994-08	E	100
1994-08	F	110
1994-08	G	NAN

For July/1994 I only have 4 columns "B" filled with values, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:

month	A	B
1994-07	C	70
1994-07	D	80
1994-08	D	95
1994-08	E	100
1994-08	F	110

I have tried:

df = df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nsmallest(len(g)//2, 'B'))

However, it does not ignore "NAN" and count it as a number. For example, for July/1994 it counts 6 values, and so it returns me the 3 (50% of 6) the highest values from the month. Instead, it should count that there are 4 values and return me the 2 highest.

CodePudding user response：

IIUC, you want to use g['B'].count() as count ignores NaNs:

(df
 .groupby('month')
 .apply(lambda g: g.nlargest(g['B'].count()//2, columns='B'))
 .droplevel(0).sort_index()
)

Alternatively, dropna first:

(df
 .dropna(subset='B')
 .groupby('month')
 .apply(lambda g: g.nlargest(len(g)//2, 'B'))
 .droplevel(0).sort_index()
)

output:

      month  A      B
2   1994-07  C   70.0
3   1994-07  D   80.0
9   1994-08  D   95.0
10  1994-08  E  100.0
11  1994-08  F  110.0

CodePudding user response：

Alternatively, you could pass median to groupby.transform, then filter values greater than the median (i.e. top 50%). Since median method skips NaN by default, no issues there.

out = df[df['B'] > df.groupby('month')['B'].transform('median')]

Output:

      month  A      B
2   1994-07  C   70.0
3   1994-07  D   80.0
9   1994-08  D   95.0
10  1994-08  E  100.0
11  1994-08  F  110.0