This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA].
Mean of the valid values is 2.5.
The resulting list should be [1, 2, 3, 4, 2.5].
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA].
Mean of the valid values is 1.5 but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.
CodePudding user response:
Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64
CodePudding user response:
Lets try list comprehension, though it will be messy
Option1
You can use pd.Series and numpy
s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]
Option2 use numpy all through
s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]
Case1
lst=[1, 2, 3, 4, np.nan]
outcome
[1, 2, 3, 4, 2.5]
Case2
lst=[1, 2, np.nan, np.nan, np.nan]
outcome
[1, 2, nan, nan, nan]
if you wanted it as a pd. Series, simply
pd.Series(s, name='lst')
How it works
s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
for x in lst\#For every element in lst
]
