Replace na's in pandas dataframe with the mean value of two column if they don't contain n-CodePudding

I have a dataset like this one

data = {'DATE': ['2012-01-01', '2013-05-16', '2013-05-28',
               '2013-06-05', '2013-06-07', '2014-02-02'],
               'Avarage_Temp': [21.8, 21.1, 22.8, 23.3, 14.4, np.nan],
               'Minimun_Temp': [14.4, np.nan, np.nan, np.nan, 15.6, 18.3],
               'Maximum_Temp': [6.7, 14.4, 11.7, 16.1, np.nan, 10.0]}

When I execute the command

data['Avarage_Temp'] = data[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1).round(1)

I get this output:

data = {'DATE': ['2012-01-01', '2013-05-16', '2013-05-28',
               '2013-06-05', '2013-06-07', '2014-02-02'],
               'Avarage_Temp': [10.6, 14.4, 11.7, 16.1, 15.6, 14.1],
               'Minimun_Temp': [14.4, np.nan, np.nan, np.nan, 15.6, 18.3],
               'Maximum_Temp': [6.7, 14.4, 11.7, 16.1, np.nan, 10.0]}

As you can see in rows 2-5 the result is wrong, because in these rows it return the value of the non missing value(if Minimun_Temp is na, gives the value of Maximum_Temp in the Avarage_Temp). So I want to have as output something like the 1st and 6th row. I want to change the value of Avarage_Temp only if the column Maximum_Temp and Minimun_Temp in this row aren't na's. If Maximum_Temp or Minimun_Temp is na, I want to keep the value of Avarage_Temp as it is.

CodePudding user response：

You can create NaNs if at least one value in Minimun_Temp, Maximum_Temp by parameter skipna=False in DataFrame.mean and then replace them by column Avarage_Temp:

df = pd.DataFrame(data)

df['Avarage_Temp'] = (df[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1, skipna=False)
                                                          .round(1)
                                                          .fillna(df['Avarage_Temp']))
print (df)
         DATE  Avarage_Temp  Minimun_Temp  Maximum_Temp
0  2012-01-01          10.6          14.4           6.7
1  2013-05-16          21.1           NaN          14.4
2  2013-05-28          22.8           NaN          11.7
3  2013-06-05          23.3           NaN          16.1
4  2013-06-07          14.4          15.6           NaN
5  2014-02-02          14.2          18.3          10.0

Another idea with Series.add for NaNs if at least one NaN per rows and for mean divide by 2:

df['Avarage_Temp'] = (df['Minimun_Temp'].add(df['Maximum_Temp'])
                                        .div(2)
                                        .round(1)
                                        .fillna(df['Avarage_Temp']))
print (df)
         DATE  Avarage_Temp  Minimun_Temp  Maximum_Temp
0  2012-01-01          10.6          14.4           6.7
1  2013-05-16          21.1           NaN          14.4
2  2013-05-28          22.8           NaN          11.7
3  2013-06-05          23.3           NaN          16.1
4  2013-06-07          14.4          15.6           NaN
5  2014-02-02          14.2          18.3          10.0

CodePudding user response：

IIUC, you want to keep the original "Avarage_Temp" if either "Minimun_Temp" or "Maximum_Temp" is NaN for that row, right? Then you can use isna any on axis to create a boolean Series that returns True if either of those are NaN for each row. Then use it as a condition in np.where and assign values to "Avarage_Temp": if NaN exists in a row, keep original, select the mean otherwise.

import numpy as np
cols = ['Minimun_Temp', 'Maximum_Temp']
data['Avarage_Temp'] = np.where(data[cols].isna().any(axis=1), data['Avarage_Temp'], data[cols].mean(axis=1).round(1))

Another option is to use skipna parameter when using mean method and use the averages as a condition in np.where:

avgs = data[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1, skipna=False).round(1)
data['Avarage_Temp'] = np.where(avgs.isna(), data['Avarage_Temp'], avgs)

Output:

         DATE  Avarage_Temp  Minimun_Temp  Maximum_Temp
0  2012-01-01          10.6          14.4           6.7
1  2013-05-16          21.1           NaN          14.4
2  2013-05-28          22.8           NaN          11.7
3  2013-06-05          23.3           NaN          16.1
4  2013-06-07          14.4          15.6           NaN
5  2014-02-02          14.2          18.3          10.0