I am trying to find the mean of a column in Pandas. The rows from which the mean will be calculated, need to be based on duplicate values in another column. In the example below, I want to populate the empty "mean_humidity" column with the mean calculated from the "humidity" column for each individual date based on the "date" column. For example, the first 3 rows in the "mean_humidity" column will be the mean calculated from the first 3 rows in the "humidity" column. Thanks!
>>> import pandas as pd
>>> dates = ['1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020', '1/2/2020']
>>> humidity = [11, 22, 33, 44, 55, 66]
>>> df = pd.DataFrame(list(zip(dates, humidity)),
... columns =['dates', 'humidity'])
>>> df["mean_humidity"] = ""
>>> df
dates humidity mean_humidity
0 1/1/2020 11
1 1/1/2020 22
2 1/1/2020 33
3 1/2/2020 44
4 1/2/2020 55
5 1/2/2020 66
CodePudding user response:
by group by :
df["mean_humidity"] = df.groupby('dates')['humidity'].transform('mean')
print(df)
output:
>>>
dates humidity mean_humidity
0 1/1/2020 11 22.0
1 1/1/2020 22 22.0
2 1/1/2020 33 22.0
3 1/2/2020 44 55.0
4 1/2/2020 55 55.0
5 1/2/2020 66 55.0
CodePudding user response:
You could use:
df["mean_humidity"] = df["humidity"].groupby(df["dates"]).transform("mean")
Output:
dates humidity mean_humidity
0 1/1/2020 11 22.0
1 1/1/2020 22 22.0
2 1/1/2020 33 22.0
3 1/2/2020 44 55.0
4 1/2/2020 55 55.0
5 1/2/2020 66 55.0
