Home > Software design >  Parsing Malformed CSV
Parsing Malformed CSV

Time:02-07

I am trying to use regex to parse a malformed CSV file.

The fields are:

  • double quoted only if have there are comma in the value, and
  • double quote would then be escaped.

Given:

Greeting,"hello" world,"hello, world","hello, ""peter""",19.99

...should output:

Greeting, "hello" world, hello, world, hello, "peter", 19.99

The regex I tried:

(?:^|,)(". "|[^,]*)(?=,|$)

I am not sure how to match the comma and escaped double quote in double quote values

CodePudding user response:

This was a lot of work.

The easiest way to achieve this is to keep all quotes as they appear in the source text and then sanitize them after parsing the fields.

Regex should not be used to parse CSV if possible.

var csv = "Greeting,\"hello\" world,\"hello, world\",\"hello, \"\"peter\"\"\",19.99";

// parse CSV while keeping all quotes
var csvLine = CsvReaderKeepQuotes.ParseLine(new StringReader(csv));

// sanitize fields with quotes
csvLine = csvLine.Select(x =>
{
    // if field was quoted completely
    if (x.Length >= 2 && x[0] == '"' && x[x.Length -1] == '"')
    {
        // remove outer quotes
        var tmp = x.Substring(1, x.Length - 2);

        // replace double quotes
        tmp = tmp.Replace("\"\"", "\"");

        return tmp;
    }

    // field was not quoted
    return x;

}).ToList();

public class CsvReaderKeepQuotes
{

    public static List<String> ParseLine(StringReader r) 
    {
        int ch = r.Read();
        while (ch == '\r') {
            //ignore linefeed chars wherever, particularly just before end of file
            ch = r.Read();
        }
        if (ch < 0) {
            return new List<string>();
        }

        var store = new List<string>();
        var curValSb = new StringBuilder();
        bool inquotes = false;
    
        while (ch >= 0)
        {
            if (inquotes)
            {
                if (ch == '\"')
                {
                    inquotes = false;
                }

                curValSb.Append((char)ch);
            }
            else
            {
                if (ch == '\"')
                {
                    inquotes = true;
                    curValSb.Append('\"');
                }
                else if (ch == ',')
                {
                    store.Add(curValSb.ToString());
                    curValSb.Clear();
          
                }
                else if (ch == '\r')
                {
                    //ignore LF characters
                }
                else if (ch == '\n')
                {
                    //end of a line, break out
                    break;
                }
                else
                {
                    curValSb.Append((char)ch);
                }
            }
            ch = r.Read();
        }
        store.Add(curValSb.ToString());
        return store;
    }

}

EDIT:

Sample Output

Debug.WriteLine(String.Join("\n",  csvLine));

Greeting
"hello" world
hello, world
hello, "peter"
19.99

CodePudding user response:

I think what you need to do is a combination of string search and regex.

Search for a double quote, then the next double quote (closing quote). regex if there are letters in between, extract that as a word, otherwise keep searching for closings double quotes. Then extract that as a string.

Then regex what remains. It might not be a pretty solution but it should work.

  •  Tags:  
  • Related