How to print the row number and starting location of a pattern when multiple matches per row are pre-CodePudding

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line. I found this somewhat related question.

So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:

echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '

The resulting output is :

1 4

But what I would expect is something like this:

1 4
1 13
1 22

I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".

echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i  ) print (NR, RSTART) } '

Output:

1 0
1 0
1 0

Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated. Also, in my case the pattern to match would be a regex (/A.C/). Thank you

CodePudding user response：

Determination of the coordinates of a string with awk:

echo "xyzABCdefghiABCdefghiABCdef" \
  | awk -v s="ABC" 'BEGIN{ len=length(s) }
      {
        for(i=1; i<=length($0); i  ){
          if(substr($0, i, len)==s){
            print NR, i
          }
        }
      }'

Output:

1 4
1 13
1 22

As one line:

echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i  ){ if(substr($0,i,len)==s) { print NR,i } } }'

Source: Find position of character with awk

CodePudding user response：

This may be what you're trying to do:

echo xyzABCdefghiABCdefghiABCdef | 
awk '{ begpos=1
       while (match(substr($0, begpos), /ABC/)) {
           print NR, begpos   RSTART - 1
           begpos  = RLENGTH   RSTART - 1
       }
     }'

CodePudding user response：

One awk idea using split() and some slicing-n-dicing of length() results:

ptn='ABC'

echo xyzABCdefghiABCdefghiABCdef | 
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
  n=split($0,arr,ptn)
  for (i=1;i<n;i  ) { 
      pos =length(arr[i] ptn)
      print NR,pos
  }
}'

This generates:

1 4
1 13
1 22

CodePudding user response：

Another option using gnu awk could be using split with a regex.

Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.

echo xyzABCdefghiABCdefghiABCdef | 
awk ' { 
  n=split($0, a, /ABC/, seps); pos=1
  for(i=1; i<n; i  ){
    pos  = length(a[i])
    print NR, pos
    pos  = length(seps[i])
  } 
}'

Output

1 4
1 13
1 22

CodePudding user response：

With your shown samples, please try following awk code.

awk '
{
  prev=0
  while(match($0,/ABC/)){
    $0=substr($0,RSTART RLENGTH)
    print FNR,prev RSTART
    prev =RSTART 2
  }
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                              ##Starting awk program from here.
{
  prev=0                           ##Setting prev variable to 0 here.
  while(match($0,/ABC/)){          ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
    $0=substr($0,RSTART RLENGTH)   ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
    print FNR,prev RSTART          ##Printing line number along with prev RSTART value here.
    prev =RSTART 2                 ##Setting prev to prev RSTART 2 here.
  }
}
'  Input_file                      ##Mentioning Input_file name here.