Home > Software engineering >  Extract string in Perl upto first occurance of specific string
Extract string in Perl upto first occurance of specific string

Time:01-27

I'm trying to write a Perl one-liner that outputs a substring from a string piped to it. The following works perfectly:

$ echo rubbishdatarubbish | perl -ne 'print  $_ =~ /rubbish(.*)rubbish/'
data

However, it breaks when there's more occurances of the ending string:

$ echo rubbishdatarubbishrubbish | perl -ne 'print  $_ =~ /rubbish(.*)rubbish/'
datarubbish

I tried adding the ? 'non-greedy' parameter (both before and after the ending string) but that does not make a difference. I appear to be using a recent version of Perl so I guess that can't be it:

$ perl -v
This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-msys-thread-multi

What am I missing? Something obvious I'm sure...

CodePudding user response:

To get the string between the first 'rubbish' and upto the second 'rubbish' you can/should use the non-greedy '?' twice.

echo rubbishdatarubbishrubbish | perl -ne 'print  $_ =~ /^.*?rubbish(.*?)rubbish/'

returns 'data'

CodePudding user response:

If I understand correctly, your data is surrounded by a fixed string, that repeats multiple times. For this, you want to use a regex to extract the data. While this is certainly possible, with the right regex skill level, it may not be the best method. Consider for example

$ echo rubbishdatarubbishrubbish | perl -ple's/rubbish//g'
data

Or

$ echo rubbishdatarubbishdatarubbishdata | perl -F/rubbish/ -lanwe'print for @F'
data
data
data

Both of these just remove rubbish from the string. The latter I used split, which allows you to more easily separate the various data. What you see above is autosplit mode:

-a                autosplit mode with -n or -p (splits $_ into @F)
-F/pattern/       split() pattern for -a switch (//'s are optional)
-l[octal]         enable line ending processing, specifies line terminator

Basically it does

perl -ne'chomp; @F = split /rubbish/, $_; print $_, $/ for @F;'

$/ is the input record separator, normally a newline.

The benefit of these methods is that you do not need balanced pairs of rubbish to encapsulate your data.

  •  Tags:  
  • Related