AWK regex split function using multiple delimiters-CodePudding

I'm trying to use Awk's split function to split input into three fields in order to use the values as field[1], field[2], field[3]. I'm trying to extract the first value: everything (including) the colon, then everything until the first tab (\t) (the hex), then the last field will include everything else.

I've tried multiple regexes and the closest I've come to solving this is:

echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{split($0,field,/([:])([ ])|([\t])/); \
print "length of field:" length(field);for (x in field) print field[x]}'

But the result doesn't include the colon --and I'm not sure if it's good regex I've written:

length of field:3
ffffffff81000000
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

Thanks in advance.

CodePudding user response：

Using gnu-awk's RS (for record separator) variable:

s=$'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'
awk -v RS='^\\S |[^\t:] ' '{gsub(/^\s*|\s*$/, "", RT); print RT}' <<< "$s"

ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

Explanation:

RS='^\\S |[^\t:] ': Sets RS as 1 non-whitespace characters at the start OR 1 of non-tab, non-colon characters
gsub(/^\s*|\s*$/, "", RT) removed whitespace at the start or end from RT variable that gets populated because of RS
print RTprintsRT` variable

If you want to print length of fields also then use:

awk -v RS='^\\S |[^\t:] ' '{gsub(/^\s*|\s*$/, "", RT); print RT} END {print "length of field:", NR}' <<< "$s"

ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
length of field: 3

If you don't have gnu-awk then here is a POSIX awk solution for the same:

awk '{
   while (match($0, /^[^[:blank:]] |[^\t:] /)) {
      print substr($0, RSTART, RLENGTH)
      $0 = substr($0, RSTART RLENGTH)
   }
}' <<< "$s"

ffffffff81000000:
 48 8d 25 51 3f 60 01
leaq asdf asdf asdf

CodePudding user response：

Your regex can be simplified as:

split($0,field,/: |\t/)

but the result will be the same without containing the colon character because the delimiter pattern is not included in the splitted result.

If you want to use a complex pattern such as a whitespace preceded by a colon as a delimiter in the split function, you will need to use PCRE which is not supported by awk.

Here is an example with python:

#!/usr/bin/python

import re

s = "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf"
print(re.split(r'(?<=:) |\t', s))

Output:

['ffffffff81000000:', '48 8d 25 51 3f 60 01', 'leaq asdf asdf asdf']

You'll see the colon is included in the result.

CodePudding user response：

Using your awk code with some changes:

echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" | awk -v OFS='\n' '
{
sub(/: */,":\t")
split($0,field,/[\t]/)
print "length of field:" length(field), field[1], field[2],field[3]
}'
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

As you can see:

added a tab with sub(),
so the separator for split() is only [\t],
and the OFS is \n.
And finally only a print.

CodePudding user response：

You can use sub to replace : with :\t and the \t with \n. You will not find \n in a line of awk text unless your programming actions put it there; it is therefor a useful delimiter. You now can split on \n and your code will work as you imagine:

echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{sub(/: /,":\t"); gsub(/\t/,"\n"); split($0,field,/\n/)
print "length of field:" length(field)
for (x=1; x<=length(field); x  ) print field[x]}'

Prints:

length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

CodePudding user response：

IMHO for a job like this you should use GNU awk for the 3rd arg to match() instead of split():

$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
    match($0,/([^:] :)\s*([^\t] )\t(.*)/,field) {
        print "length of field:" length(field);for (x in field) print x, field[x]
    }
'
length of field:12
0start 1
0length 58
3start 40
1start 1
2start 19
3length 19
2length 20
1length 17
0 ffffffff81000000: 48 8d 25 51 3f 60 01        leaq asdf asdf asdf
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf

Note that the resultant array has a lot more information than just the 3 fields that get populated with the strings that match the regexp segments. Just ignore the extra fields if you don't need them:

$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
    match($0,/([^:] :)\s*([^\t] )\t(.*)/,field) {
        for (x=1; x<=3; x  ) print x, field[x]
    }
'
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf

CodePudding user response：

perl may be a better choice than awk for the task as hand:

#!/bin/bash

perl -F'\t|(?<=:)\x20' -ane 'print "length of field:" . @F . "\n", join("\n", @F)' <<< $'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'

length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

It's even easier if you don't need to print the length of field:

perl -pe 's/(?<=:) |\t/\n/g' <<< $'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'

ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf