Is there a python function to get columns according to NaN percentage?-CodePudding

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)

     0    1    2   3    4    5    6   7    8   9
0  NaN  NaN  1.0 NaN  NaN  NaN  NaN NaN  NaN NaN
1  NaN  NaN  NaN NaN  NaN  NaN  NaN NaN  1.0 NaN
2  NaN  NaN  NaN NaN  NaN  NaN  NaN NaN  1.0 NaN
3  1.0  1.0  NaN NaN  NaN  1.0  NaN NaN  1.0 NaN
4  NaN  NaN  NaN NaN  NaN  NaN  1.0 NaN  NaN NaN
5  NaN  NaN  1.0 NaN  NaN  NaN  NaN NaN  NaN NaN
6  1.0  NaN  NaN NaN  1.0  NaN  NaN NaN  1.0 NaN
7  NaN  NaN  NaN NaN  1.0  NaN  1.0 NaN  NaN NaN
8  1.0  NaN  NaN NaN  NaN  NaN  1.0 NaN  NaN NaN
9  NaN  NaN  NaN NaN  1.0  NaN  NaN NaN  NaN NaN

For the above dataframe, what's the simplest code to get column names which are below the given NAN percentage(threshold)?

To get the column names that are below 30% NaN, I'm able to do this using the below code

col_list = df.dropna(thresh=df.shape[0]*0.3, 
                     how='all', axis=1).columns.to_list()

col_list

[0, 4, 6, 8]

What's the simplest code to get such column names?

CodePudding user response：

You can do

df.isna().mean().loc[lambda x : x<0.3]
Out[59]: 
1    0.1
6    0.2
7    0.0
8    0.2
dtype: float64
# df.notna().mean().loc[lambda x : x<0.3].index

CodePudding user response：

Alternative:

col_list = df.columns[df.count() / df.shape[0] >= 0.3].tolist()
print(col_list)

# Output:
[0, 4, 6, 8]

CodePudding user response：

If you just want to do it via indexing, something like that should work

df.columns[df.isna().sum() / df.shape[0] < 0.3] # columns names
df.loc[:, df.isna().sum() / df.shape[0] < 0.3] # columns with data

So, computing the fraction of nums, getting if that is below the threshold, and using loc to extract the respective columns.