When I run the following code in Jupyter Lab
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])
Then it went errors and note that ValueError: could not convert string to float: 'Mme',details are like these:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
6 selector = SelectKBest(f_classif,k=5)
----> 7 selector.fit(titanic[predictors],titanic["Survived"])
......
ValueError: could not convert string to float: 'Mme'
I tried to print titanic[predictors] and titanic["Survived"],then the details are follows:
Pclass Sex Age SibSp Parch Fare Embarked FamilySize Title NameLength
0 3 0 22.0 1 0 7.2500 0 1 1 23
1 1 1 38.0 1 0 71.2833 1 1 3 51
2 3 1 26.0 0 0 7.9250 0 0 2 22
3 1 1 35.0 1 0 53.1000 0 1 3 44
4 3 0 35.0 0 0 8.0500 0 0 1 24
... ... ... ... ... ... ... ... ... ... ...
886 2 0 27.0 0 0 13.0000 0 0 6 21
887 1 1 19.0 0 0 30.0000 0 0 2 28
888 3 1 28.0 1 2 23.4500 0 3 2 40
889 1 0 26.0 0 0 30.0000 1 0 1 21
890 3 0 32.0 0 0 7.7500 2 0 1 19
891 rows × 10 columns
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
How to Solve this Problem?
CodePudding user response:
is it printing column labels in first line? if so then you do proper data assigning so assign the array starting from second row array[1:,:] otherwise try to look into it and see where is "Mme" string located so you understand how the code is fetching it.
CodePudding user response:
When you are trying to fit some algorithm (in your case SelectKBest), you need to be aware of your data. And, almost all time you need to preprocess it.
Take a look to your data:
- Do you have categorical features or they are numerical? Or a mix?
- Do you have NaN values?
- ...
Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder).
You will have the same problem with NaN values.
In conclusion, before start fitting, you have to preprocess your data.
