My dataset has the following columns:
Voted? Political Category
Yes Right
No Left
Not Answered Center
Yes Right
Yes Right
No Right
I would need to calculate the chi square to see which category is mostly associated with people who voted. Both columns contain strings. How can I give each of the value a numeric representation in order to apply chi-square?
CodePudding user response:
You can use pd.factorize to encode your categorical variables:
df['nVoted?'] = pd.factorize(df['Voted?'])[0]
df['nCategory'] = pd.factorize(df['Political Category'])[0]
print(df)
# Output
Voted? Political Category nVoted? nCategory
0 Yes Right 0 0
1 No Left 1 1
2 Not Answered Center 2 2
3 Yes Right 0 0
4 Yes Right 0 0
5 No Right 1 0
After that you can use scipy.stats.chisquare
