Replace newline in quoted strings in huge files-CodePudding

I have a few huge files with values seperated by a pipe (|) sign. The strings our quoted but sometimes there is a newline in between the quoted string.

I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.

I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.

I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.

The statement I tried but that isn't working:

perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt

Sample text:

4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....

Should become:

4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....

CodePudding user response：

Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):

$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
  $csv->say(*STDOUT, [ map { tr/\n/ /r } @$row ]) 
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "

This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.

CodePudding user response：

In your specific case, you can also consider a workaround using GNU sed or awk.

An awk command will look like

awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile

The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).

A GNU sed command will look like

sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file

This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.

With perl, if the file is huge but it can still fit into memory, you can use a short

perl -0777 -pi -e 's/\R  (?!\d{4,}\|)/ /g'  <<< "$s"

With -0777, you slurp the file and the \R (?!\d{4,}\|) pattern matches any one or more line breaks (\R ) not followed with four or more digits followed with a | char. The possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.