.CSV wrong characters: Changing encoding to UTF-8 pandas / python-CodePudding

I'm reading the following .csv file into pandas:

But some characters are wrong, like:

colisÃ£o should be colisão

NÃ£o should be Não

Weiss TÃ¡xi AÃ©reo should be Weiss Táxi Aéreo

I tried to convert them:

import pandas as pd

df= pd.read_csv('./ex/sample.csv', sep=';', encoding='latin-1')
df.to_csv('new_file.csv', encoding='utf-8')

But reading new_file.csv keeps the words wrong.

How can i convert every character from this file to the correct ones? Is this some kind of encoding problem, right?

CodePudding user response：

just don't mention any encoding , and you'll be fine

import pandas as pd

df= pd.read_csv('./ex/sample.csv', sep=';',)
df.to_csv('new_file.csv')

CodePudding user response：

You have a corrupted file. It is actually encoded in UTF-8, but after one bad record (below) all the rows are ;;;;;;;;;;;;;;;;;;;;;;;;;;. Note the encoding_errors='replace' parameter requires pandas 1.3.0 or later:

>>> import pandas as pd
>>> df = pd.read_csv('sample.csv',encoding='utf8',sep=';',encoding_errors='replace')
sys:1: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,
41,43,44,45,54,55,57) have mixed types.Specify dtype option on import or set low_memory=False.
>>> df
       codigo_reporte   status classificacao_ocorrencia  ... numeroMotor numeroHelice outraAnvParte
0               251.0  Aprovar                Incidente  ...         NaN          NaN           NaN
1               675.0  Aprovar                Incidente  ...           1          NaN           NaN
2               247.0  Aprovar                Incidente  ...         NaN          NaN           NaN
3               248.0  Aprovar                Incidente  ...         NaN          NaN           NaN
4               249.0  Aprovar                Incidente  ...         NaN          NaN           NaN
...               ...      ...                      ...  ...         ...          ...           ...
63775             NaN      NaN                      NaN  ...         NaN          NaN           NaN
63776             NaN      NaN                      NaN  ...         NaN          NaN           NaN
63777             NaN      NaN                      NaN  ...         NaN          NaN           NaN
63778             NaN      NaN                      NaN  ...         NaN          NaN           NaN
63779             NaN      NaN                      NaN  ...         NaN          NaN           NaN

[63780 rows x 58 columns]
>>> df[:105]
     codigo_reporte   status classificacao_ocorrencia  ... numeroMotor numeroHelice   outraAnvParte
0             251.0  Aprovar                Incidente  ...         NaN          NaN             NaN
1             675.0  Aprovar                Incidente  ...           1          NaN             NaN
2             247.0  Aprovar                Incidente  ...         NaN          NaN             NaN
3             248.0  Aprovar                Incidente  ...         NaN          NaN             NaN
4             249.0  Aprovar                Incidente  ...         NaN          NaN             NaN
..              ...      ...                      ...  ...         ...          ...             ...
100           924.0  Aprovar                Incidente  ...           1          NaN             NaN
101          2923.0  Aprovar                Incidente  ...         NaN          NaN  N�?O INFORMADO
102             NaN      NaN                      NaN  ...         NaN          NaN             NaN
103             NaN      NaN                      NaN  ...         NaN          NaN             NaN
104             NaN      NaN                      NaN  ...         NaN          NaN             NaN

[105 rows x 58 columns]

Note the bad data in the last column of index 101. That's the source of your UnicodeDecodeError mentioned in the comments. After that all the columns are Nan and the original file is just empty fields with semicolons.