Home > OS >  Perl multiline regex in windows
Perl multiline regex in windows

Time:01-28

I'm stuck with this scenario, I have this regex

*Input added here for clarity:

181221533;MG;3;1476729;<vars>  <vint>    <name>mtest</name> <storedPrecedure>f_sc_mtest</SP>    <base>M_data</base>    <dataType>I</dataType>    <timeMS>17</timeMS>    <ttidr>abc</ttidr>  <base>S</base>    <valor>0</valor>  </vint>  </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars>  <vint>    <name>mtest</name>    <sP>f_sc_mtest</sP> <base>sscy</base>    <dataType>I</dataType>    <timeMS>16</timeMS>    <ttidr>abc</Idtype>    <base>S</base>    <valor>4</valor>  </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeMS>0</timeMS>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22

182652988;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeProcess>1</timeProcess>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

And I want to implement this regex in perl with multiline support because as you can see in the sample, there are line breaks in records and this regex searchs 'incomplete' lines (and the extra line) and fixes them (one record/line should end with a datetime)

this is what I'm attempting with perl:

perl.exe -0777 -i -pe "s/(?m)^(.*)(>)([\n] )(<)(.*)([\n] )(\s*)$/$1$2    $4$5/igs" "sample.txt"

And doesn't seem to work, I keep getting the same text file. I'm using perl inside a portable GIT installation (v5.34.0)

Is there something I'm missing?

edit: This is how the output should look like:

181221533;MG;3;1476729;<vars>  <vint>    <name>mtest</name> <storedPrecedure>f_sc_mtest</SP>    <base>M_data</base>    <dataType>I</dataType>    <timeMS>17</timeMS>    <ttidr>abc</ttidr>  <base>S</base>    <valor>0</valor>  </vint>  </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars>  <vint>    <name>mtest</name>    <sP>f_sc_mtest</sP> <base>sscy</base>    <dataType>I</dataType>    <timeMS>16</timeMS>    <ttidr>abc</Idtype>    <base>S</base>    <valor>4</valor>  </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeMS>0</timeMS>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>    </vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeProcess>1</timeProcess>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>    </vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

CodePudding user response:

This seems to produce the wanted output:

perl.exe -0777 -pe "s: *\n(?=</):    :g;s/\n /\n/g"
  • The first substitution replaces whitespace followed by a newline before </ by four spaces.
  • The second substitution replaces multiple newlines by a single one. You can also replace it by a transliteration: tr/\n//s, the /s "squeezes" the newlines.

CodePudding user response:

If the issue is having newlines in the wrong place, either multiple newlines in a row, or before a <, you may get away with something simple like this:

use strict;
use warnings;

my $str = do { local $/; <DATA> };

$str =~ s/\n(?=[<\n])//g;
print $str;

__DATA__
181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22

182652988; </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

(I shortened the input to make it readable)

Output:

181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint></vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988; </vint></vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

CodePudding user response:

Capture the whole record and replace all newlines in it by a space, using another regex inside the replacement part (courtesy of /e modifier). Then replace all multiple newlines by a single one

perl.exe -0777 -wpe'
    s{ (?:^|\R)\K (\d{9}; .*? \s \d\d:\d\d:\d\d) }{$1 =~ s/\n / /r}segx; s{\n }{\n}g
' file.txt

I consider a "record" to be: [0-9]{9}; on line/file beginning, then all up to and including a timestamp after spaces. The details for beginning and end of record should protect against accidental matching of possible unexpected patterns inside those tags.

This is cumbersome but it captures the record correctly I hope, even if some details change.


Apparently the above fails on Windows as it stands, while it is confirmed to work on Linux (the only system I can try it on right now).

The issue must be in newlines -- so try replacing \n in matches with \R or \r\n. In particular in the regex embedded in the replacement part. Or, to be safe and perhaps portable, replace \n with (\r?\n) (so the carriage return character is optional, need not be there)

So either

s{ (?:^|\R)\K (\d{9}; .*? \s \d\d:\d\d:\d\d) }{$1 =~ s/\R / /r}segx; s{\R }{\r\n}g

or

s{ (?:^|\R)\K(\d{9};.*?\s \d\d:\d\d:\d\d) }{$1 =~ s/(\r\n) / /r}segx; s{(\r\n) }{\r\n}g

But \R should match it on Windows, so you should be able to use \R for matching and \r\n when needed in replacements. See it under Misc in perlbackslash


Better yet, if it works, is to use PerlO layers. Normally a Windows build of Perl adds the `:crlf: layer by default but apparently that's not the case here.

In a one-liner try:

perl.exe -0777 -Mopen=:std,IO,:crlf -wpe'...'

Or, use the "one-liner" as a normal program, without file-processing switches, and set this up via open pragma and open a file manually

perl -wE'use open IO => ":crlf"; $_ = do { local $/; <> }; s{...}{...}; say' file

With layers set like this (either way) use the regex with \n.

  •  Tags:  
  • Related