I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:
But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.
But after I saved it in Excel (using , separator and also tried , UTF-8), and when trying pd.read_csv on it, it gives me this error:
Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Then I tried to use sep=';' with read_csv, and it worked, but now all Russian characters are replaced with strange symbols:
Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:
'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte
This is what the first file looks like (not modified Excel .csv file):
When I open second file (modified):
CodePudding user response:
Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.




