Home > Net >  Named field with Umlaut not recognized with Cassava
Named field with Umlaut not recognized with Cassava

Time:01-11

I'm trying to parse a CSV file containing German text, i.e., it is not "comma" separated, but semicolon separated and it may contain Umlauts (äöü etc).

Using Cassava and following the linked tutorial, for a column with a header including an Umlaut, I'm getting the error:

parse error (Failed reading: conversion error: no field named "W\228hrung") at "\nEUR;0,99"

Where the minimal CSV file causing the error is:

Währung;Betrag
EUR;14,12
EUR;0,99

Data type and FromNamedRecord instance:

data Transaction = Tx
  { waehrung :: Text
  , betrag :: Betrag
  }

instance FromNamedRecord Transaction where
  parseNamedRecord m =
    Tx
      <$> m .: "Währung"
      <*> m .: "Betrag"

The CSV is encoded as UTF-8 and I'm setting setLocaleEncoding utf8 in main. Like the tutorial, I'm using the OverloadedStrings extensions, so "Währung" is a ByteString.

Versions: GHC 8.10.7 cassava ^>=0.5.2.0

Full MRE code with Cabal file and CSV can found in this Gist

CodePudding user response:

You need to write:

instance FromNamedRecord Transaction where
  parseNamedRecord m =
    Tx
      <$> m .: Text.encodeUtf8 "Währung"
      <*> m .: "Betrag"

The problem is that cassava is internally representing field names as the ByteString of the UTF-8 encoding of the text. However, the IsString instance for ByteStrings which is used to encode a string literal to a ByteString does not use UTF-8 encoding but rather encodes each character as the least-significant byte of its code point (which is basically never what you want for non-ASCII strings).

  •  Tags:  
  • Related