Say I have this dummy panda's df:
Feature1 Featrue2
0 X 0
1 X 0
2 Y 0
3 Y 1
4 Y 1
5 X 1
6 Y 0
7 X 1
8 Y 1
9 X 0
How do I calculate the average of Feature2, only when the value of Feature1 is X, and the average of Feature2 again, just when the value of Feature1 is Y? I figure it's by using groupby, however it's not working for me.
My attempt (making a function to find the difference in the two averages):
def diff_of_avg(df, column_name , groupby_var):
groupby_var = df.groupby(groupby_var)
avgs = groupby_var[column_name].mean()
return avgs.loc['1'] - avgs.loc['0']
where groupby_var is Feature2
and column_name is Feature1
CodePudding user response:
You can indeed use groupby():
df2 = df.groupby('Feature1').mean()
Ouput:
Featrue2
Feature1
X 0.4
Y 0.6
Docs for mean() give some examples as well.
To find the difference in the averages of X and Y, you can do this:
diffOfAverages = df.groupby('Feature1').mean().diff().iloc[-1,-1]
Output:
0.19999999999999996
