ValueError: could not convert string to float: 'Virus'-CodePudding

How do I fix the problem below? The data is video games sales available on Kaggle.

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])

y = df.drop(columns= ['Rank' , 'Year' , 'Platform' , 'Year' , 'Genre' , 'Publisher' ])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

prediction = model.predict(X_test)

CodePudding user response：

The error clearly states it could not convert a string to float since it does not represent a number. You'll have to do some data validation probably. Where does the error occur exactly?

CodePudding user response：

The problem is clearly not the exception. You use ML without explain what you really want. DecisionTreeClassifier is a classifier. So with input data, the model try to determine the class of the input data.

If I load your data:

>>> X.columns  # the input data (features)
['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher']

>>> y.columns  # the output data (target)
['Name', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']

X are not prepared to machine learning and y does not look like a target, so your data is unusable.

So what do you want to find with this dataset?

CodePudding user response：

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])

y = df.drop(columns= ['Rank' , 'Year' , 'Platform' , 'Year' , 'Genre' , 'Publisher' ])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

prediction = model.predict(X_test)

gives

df = pd.read_csv('vgsales.csv') X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales'])
                                    ^
SyntaxError: invalid syntax

One solution is to simply put X = df.drop(columns=['NA_Sales' , 'EU_Sales' , 'JP_Sales' , 'Other_Sales' , 'Global_Sales']) in a new line.

Then convert string to float using float() function.