Example Data
| ID | Name | Phone |
|---|---|---|
| 1 | x | 212 |
| 2 | y | NaN |
| 3 | xy | NaN |
df is the name of the dataset The code below gave the names of the columns with no missing values.
no_nulls = set(df.columns[df.isnull().mean()==0])
isnull() will convert the dataset into something like this
| ID | Name | Phone |
|---|---|---|
| False | False | False |
| False | False | True |
| False | False | True |
Can some one explain how mean will work on non-integers?
I used this and it worked but i am curious about mean
no_nulls = set(df.columns[df.notnull().all()])
CodePudding user response:
Your case, .mean() is processing a dataframe of boolean values with True and False values only. In this case, .mean() treat False as 0 and True as 1. Hence, if you look at the result of df.isnull().mean(), you will see:
df.isnull().mean()
ID 0.000000
Name 0.000000
Phone 0.666667
dtype: float64
Here, as columns ID and Name have all False values, .mean() will treat all as zeros and get a mean of zero. For column Phone, you have one False and 2 True, hence, the mean is equivalent to taking mean of 0, 1, 1, i.e. 0.666667.
As a result, when you check for df.isnull().mean()==0, only the first 2 columns will be True and hence, you get {'ID', 'Name'} for the result of no_nulls.
Referring to the official document of DataFrame.mean, you will get some hint from the parameter numeric_only= and notice its default behavior with default setting:
Parameters
numeric_only bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
