I have a pandas dataframe (adjusted_data) containing many independent variables and a target variable called RainTomorrow. I found out how I can get the correlation between the independent variables and the target variable by using:
adjusted_data.corr()['RainTomorrow'][:].abs()
I would like to create a new dataframe (adjusted_data_narrowed) that only consists of columns where the correlation value is above a certain threshold. What is the best way to do that?
CodePudding user response:
This is all you need:
df2 = adjusted_data.corr()['RainTomorrow'][:].abs()
df2[df2>0.05]
CodePudding user response:
The following should work. It might not be the best solution or the most convenient, but I think it will do for now. You can replace the threshold with the correlation that you want. For example if you only want columns where the correlation is higher than 0.5 or lower than -0.5, use 0.5 instead of threshold.
from itertools import combinations
corr = adjust_data.corr()
passed = set()
for (r,c) in combinations(corr.columns, 2):
if (abs(corr.loc[r,c]) >= threshold):
passed.add(r)
passed.add(c)
passed = sorted(passed)
corr = corr.loc[passed, passed]
corr is now your correlation matrix where you can see which column does not meet your requirement. Now you can filter your dataframe via:
df_adjusted = df_adjusted[corr.columns]
