Conditional imputation with average of non-missing columns with pandas toolbox-CodePudding

This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.

In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.

Are there pandas function to do this with?

Example

Values [1, 2, 3, 4, NA]. Mean of the valid values is 2.5. The resulting list should be [1, 2, 3, 4, 2.5].

Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.

Values [1, 2, NA, NA, NA]. Mean of the valid values is 1.5 but it does not matter. The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.

CodePudding user response：

Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:

from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])

def fillna_mean(s, N=4):
    return s if s.notna().sum() < N else s.fillna(s.mean())

fillna_mean(s1)
# 0    1.0
# 1    2.0
# 2    3.0
# 3    4.0
# 4    2.5
# dtype: float64

fillna_mean(s2)
# 0       1
# 1       2
# 2    <NA>
# 3    <NA>
# 4    <NA>
# dtype: object

fillna_mean(s2, N=2)
# 0    1.0
# 1    2.0
# 2    1.5
# 3    1.5
# 4    1.5
# dtype: float64

CodePudding user response：

Lets try list comprehension, though it will be messy

Option1

You can use pd.Series and numpy

  s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]

Option2 use numpy all through

 s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]

Case1

lst=[1, 2, 3, 4, np.nan]

outcome

[1, 2, 3, 4, 2.5]

Case2

lst=[1, 2, np.nan, np.nan, np.nan]

outcome

[1, 2, nan, nan, nan]

if you wanted it as a pd. Series, simply

pd.Series(s, name='lst')

How it works

s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
   
   else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
   
   for x in lst\#For every element in lst
  ]