Apply Chi-Square to dataset which contains categorical variables-CodePudding

My dataset has the following columns:

Voted? Political Category
Yes            Right
No             Left
Not Answered   Center
Yes            Right
Yes            Right
No             Right

I would need to calculate the chi square to see which category is mostly associated with people who voted. Both columns contain strings. How can I give each of the value a numeric representation in order to apply chi-square?

CodePudding user response：

You can use pd.factorize to encode your categorical variables:

df['nVoted?'] = pd.factorize(df['Voted?'])[0]
df['nCategory'] = pd.factorize(df['Political Category'])[0]
print(df)

# Output
         Voted? Political Category  nVoted?  nCategory
0           Yes              Right        0          0
1            No               Left        1          1
2  Not Answered             Center        2          2
3           Yes              Right        0          0
4           Yes              Right        0          0
5            No              Right        1          0

After that you can use scipy.stats.chisquare