I'm trying to parse a CSV file containing German text, i.e., it is not "comma" separated, but semicolon separated and it may contain Umlauts (äöü etc).
Using Cassava and following the linked tutorial, for a column with a header including an Umlaut, I'm getting the error:
parse error (Failed reading: conversion error: no field named "W\228hrung") at "\nEUR;0,99"
Where the minimal CSV file causing the error is:
Währung;Betrag
EUR;14,12
EUR;0,99
Data type and FromNamedRecord instance:
data Transaction = Tx
{ waehrung :: Text
, betrag :: Betrag
}
instance FromNamedRecord Transaction where
parseNamedRecord m =
Tx
<$> m .: "Währung"
<*> m .: "Betrag"
The CSV is encoded as UTF-8 and I'm setting setLocaleEncoding utf8 in main.
Like the tutorial, I'm using the OverloadedStrings extensions, so "Währung" is a ByteString.
Versions: GHC 8.10.7 cassava ^>=0.5.2.0
Full MRE code with Cabal file and CSV can found in this Gist
CodePudding user response:
You need to write:
instance FromNamedRecord Transaction where
parseNamedRecord m =
Tx
<$> m .: Text.encodeUtf8 "Währung"
<*> m .: "Betrag"
The problem is that cassava is internally representing field names as the ByteString of the UTF-8 encoding of the text. However, the IsString instance for ByteStrings which is used to encode a string literal to a ByteString does not use UTF-8 encoding but rather encodes each character as the least-significant byte of its code point (which is basically never what you want for non-ASCII strings).
