I have a dataset like this one
data = {'DATE': ['2012-01-01', '2013-05-16', '2013-05-28',
'2013-06-05', '2013-06-07', '2014-02-02'],
'Avarage_Temp': [21.8, 21.1, 22.8, 23.3, 14.4, np.nan],
'Minimun_Temp': [14.4, np.nan, np.nan, np.nan, 15.6, 18.3],
'Maximum_Temp': [6.7, 14.4, 11.7, 16.1, np.nan, 10.0]}
When I execute the command
data['Avarage_Temp'] = data[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1).round(1)
I get this output:
data = {'DATE': ['2012-01-01', '2013-05-16', '2013-05-28',
'2013-06-05', '2013-06-07', '2014-02-02'],
'Avarage_Temp': [10.6, 14.4, 11.7, 16.1, 15.6, 14.1],
'Minimun_Temp': [14.4, np.nan, np.nan, np.nan, 15.6, 18.3],
'Maximum_Temp': [6.7, 14.4, 11.7, 16.1, np.nan, 10.0]}
As you can see in rows 2-5 the result is wrong, because in these rows it return the value of the non missing value(if Minimun_Temp is na, gives the value of Maximum_Temp in the Avarage_Temp). So I want to have as output something like the 1st and 6th row. I want to change the value of Avarage_Temp only if the column Maximum_Temp and Minimun_Temp in this row aren't na's. If Maximum_Temp or Minimun_Temp is na, I want to keep the value of Avarage_Temp as it is.
CodePudding user response:
You can create NaNs if at least one value in Minimun_Temp, Maximum_Temp by parameter skipna=False in DataFrame.mean and then replace them by column Avarage_Temp:
df = pd.DataFrame(data)
df['Avarage_Temp'] = (df[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1, skipna=False)
.round(1)
.fillna(df['Avarage_Temp']))
print (df)
DATE Avarage_Temp Minimun_Temp Maximum_Temp
0 2012-01-01 10.6 14.4 6.7
1 2013-05-16 21.1 NaN 14.4
2 2013-05-28 22.8 NaN 11.7
3 2013-06-05 23.3 NaN 16.1
4 2013-06-07 14.4 15.6 NaN
5 2014-02-02 14.2 18.3 10.0
Another idea with Series.add for NaNs if at least one NaN per rows and for mean divide by 2:
df['Avarage_Temp'] = (df['Minimun_Temp'].add(df['Maximum_Temp'])
.div(2)
.round(1)
.fillna(df['Avarage_Temp']))
print (df)
DATE Avarage_Temp Minimun_Temp Maximum_Temp
0 2012-01-01 10.6 14.4 6.7
1 2013-05-16 21.1 NaN 14.4
2 2013-05-28 22.8 NaN 11.7
3 2013-06-05 23.3 NaN 16.1
4 2013-06-07 14.4 15.6 NaN
5 2014-02-02 14.2 18.3 10.0
CodePudding user response:
IIUC, you want to keep the original "Avarage_Temp" if either "Minimun_Temp" or "Maximum_Temp" is NaN for that row, right? Then you can use isna any on axis to create a boolean Series that returns True if either of those are NaN for each row. Then use it as a condition in np.where and assign values to "Avarage_Temp": if NaN exists in a row, keep original, select the mean otherwise.
import numpy as np
cols = ['Minimun_Temp', 'Maximum_Temp']
data['Avarage_Temp'] = np.where(data[cols].isna().any(axis=1), data['Avarage_Temp'], data[cols].mean(axis=1).round(1))
Another option is to use skipna parameter when using mean method and use the averages as a condition in np.where:
avgs = data[['Minimun_Temp', 'Maximum_Temp']].mean(axis=1, skipna=False).round(1)
data['Avarage_Temp'] = np.where(avgs.isna(), data['Avarage_Temp'], avgs)
Output:
DATE Avarage_Temp Minimun_Temp Maximum_Temp
0 2012-01-01 10.6 14.4 6.7
1 2013-05-16 21.1 NaN 14.4
2 2013-05-28 22.8 NaN 11.7
3 2013-06-05 23.3 NaN 16.1
4 2013-06-07 14.4 15.6 NaN
5 2014-02-02 14.2 18.3 10.0
