How do I fix the problem below? The data is video games sales available on Kaggle.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])
y = df.drop(columns= ['Rank' , 'Year' , 'Platform' , 'Year' , 'Genre' , 'Publisher' ])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
CodePudding user response:
The error clearly states it could not convert a string to float since it does not represent a number. You'll have to do some data validation probably. Where does the error occur exactly?
CodePudding user response:
The problem is clearly not the exception. You use ML without explain what you really want. DecisionTreeClassifier is a classifier. So with input data, the model try to determine the class of the input data.
If I load your data:
>>> X.columns # the input data (features)
['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher']
>>> y.columns # the output data (target)
['Name', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
X are not prepared to machine learning and y does not look like a target, so your data is unusable.
So what do you want to find with this dataset?
CodePudding user response:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])
y = df.drop(columns= ['Rank' , 'Year' , 'Platform' , 'Year' , 'Genre' , 'Publisher' ])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
gives
df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])
^
SyntaxError: invalid syntax
One solution is to simply put X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales']) in a new line.
Then convert string to float using float() function.
