When reading a csv file with pandas.read_csv, it seems that when using the keyword converters some other preprocessing arguments (tested with decimal and na_values) seem to have no effect. Example:
import pandas as pd
from io import StringIO
txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 n/a 7"
buffer = StringIO(txt)
converters = {i: lambda x: x for i in range(3)} # dummy converter
df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters)
print(df1)
print(df2)
A B C
0 12.5 4.0 NaN
1 3.1 7.0 5.6
2 8.0 NaN 7.0
A B C
0 12,5 4 xxx
1 3,1 7 5,6
2 8 n/a 7
df1 is imported correctly, while df2 holds unconverted strings (dtype: object). Obviously the arguments decimal and na_values are ignored as soon as a converter is defined explicitly for the column. Even the integrated default NA-conversion of the string "n/a" fails (keep_default_na=True)
Question: Is there a way to use converters and other preprocessing arguments together? Or do I have to extend my application specific converters with additional decimal sign and NA-value converters manually?
CodePudding user response:
No it's not possible. Even if the documentation is not clear about sep, decimal and thousands parameters except for dtype:
If converters are specified, they will be applied INSTEAD of dtype conversion.
Consider this parameters as only used by the default converter of read_csv.
A possible solution is to let read_csv parse your file then use assign to modify the value of each column.
For instance:
df = (pd.read_csv('input.csv', sep=';', decimal=',', thousands='_')
.assign(col1=lambda x: converters(x['col1']),
col2=lambda x: converters(x['col2']))
CodePudding user response:
I didn't look into detail why it is so, but it looks like this depends on engine used. You can specify engine='python' in order for it to work
import pandas as pd
from io import StringIO
txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 4 7"
buffer = StringIO(txt)
converters = {i: lambda x: x for i in range(3)} # dummy converter
df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters, engine='python')
print(df1)
print(df2)
output:
A B C
0 12.5 4 NaN
1 3.1 7 5.6
2 8.0 4 7.0
A B C
0 12.5 4 NaN
1 3.1 7 5.6
2 8 4 7
