Replace every 4th occurence of char "_" with "@" in multiple files-CodePudding

I am trying to replace every 4th occurrence of "_" with "@" in multiple files with bash.

E.g.

foo_foo_foo_foo_foo_foo_foo_foo_foo_foo..

would become

foo_foo_foo_foo@foo_foo_foo_foo@foo_foo...

#perl -pe 's{_}{  $n % 4 ? $& : "@"}ge' *.txt

I have tried perl but the problem is this replaces every 4th _ carrying on from the last file. So for example, some files the first _ is replaced because it is not starting each new file at a count of 0, it carries on from the previous file.

I have tried:

#awk '{for(i=1; i<=NF; i  ) if($i=="_") if(  count%4==0) $i="@"}1' *.txt

but this also does not work.

Using sed I cannot find a way to keep replacing every 4th occurrence as there are different numbers of _ in each file. Some files have 20 _, some have 200 _. Therefore, I cant specify a range.

I am really lost what to do, can anybody help?

CodePudding user response：

You just need to reset the counter in the perl one using eof to tell when it's done reading each file:

perl -pe 's{_}{  $n % 4 ? "_" : "@"}ge; $n = 0 if eof' *.txt

CodePudding user response：

This MAY be what you want, using GNU awk for RT:

$ awk -v RS='_' '{ORS=(FNR%4 ? RT : "@")} 1' file
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo..

It only reads each _-separated string into memory 1 at a time so should work no matter how large your input file, assuming there are _s in it.

It assumes you want to replace every 4th _ across the whole file as opposed to within individual lines.

CodePudding user response：

A simple sed would handle this:

s='foo_foo_foo_foo_foo_foo_foo_foo_foo_foo'
sed -E 's/(([^_] _){3}[^_] )_/\1@/g' <<< "$s"

foo_foo_foo_foo@foo_foo_foo_foo@foo_foo

Explanation:

(: Start capture group #1
- ([^_] _){3}: Match Match 1 of non-_ characters followed by a _. Repeat this group 3 times to match 3 such words separated by _
- [^_] : Match 1 of non-_ characters
): End capture group #1
_: Match a _
Replacement is \1@ to replace 4th _ with a @

CodePudding user response：

With GNU awk

$ cat ip.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
123_45678_90
_

$ awk -v RS='(_[^_] ){3}_' -v ORS= '{sub(/_$/, "@", RT); print $0 RT}' ip.txt
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
123_45678_90
@

-v RS='(_[^_] ){3}_' set input record separator to cover sequence of four _ (text matched by this separator will be available via RT)
-v ORS= empty output record separator
sub(/_$/, "@", RT) change last _ to @
Use -i inplace for inplace editing.

CodePudding user response：

If the count should reset for each line:

perl -pe's/(?:_[^_]*){3}\K_/\@/g'

$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo

$ perl -pe's/(?:_[^_]*){3}\K_/\@/g' a.txt a.txt
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo

If the count shouldn't reset for each line, but should reset for each file:

perl -0777pe's/(?:_[^_]*){3}\K_/\@/g'

The -0777 cause the whole file to be treated as one line. This causes the count to work properly across lines.

But since a new a match is used for each file, the count is reset between files.

$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo

$ perl -0777pe's/(?:_[^_]*){3}\K_/\@/g' a.txt a.txt
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
foo_foo_foo@foo_foo_foo_foo@foo_foo_foo
foo_foo_foo_foo@foo_foo_foo_foo@foo_foo
foo_foo_foo@foo_foo_foo_foo@foo_foo_foo

To avoid that reading the entire file at once, you could continue using the same approach, but with the following added:

$n = 0 if eof;

Note that eof is not the same thing as eof()! See eof.

CodePudding user response：

With GNU sed:

sed -nsE ':a;${s/(([^_]*_){3}[^_]*)_/\1@/g;p};N;ba' *.txt

-n suppresses the automatic printing, -s processes each file separately, -E uses extended regular expressions.

The script is a loop between label a (:a) and the branch-to-label-a command (ba). Each iteration appends the next line of input to the pattern space (N). This way, after the last line has been read, the pattern space contains the whole file(*). During the last iteration, when the last line has been read ($), a substitute command (s) replaces every 4th _ in the pattern space by a @ (s/(([^_]*_){3}[^_]*)_/\1@/g) and prints (p) the result.

When you will be satisfied with the result you can change the options:

sed -i -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1@/g;p};N;ba' *.txt

to modify the files in-place, or:

sed -i.bkp -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1@/g;p};N;ba' *.txt

to modify the files in-place, but keep a *.txt.bkp backup of each file.

(*) Note that if you have very large files this could cause memory overflows.

CodePudding user response：

With your shown samples, please try following awk program. Have created an awk variable named fieldNum where I have assigned 4 to it, since OP needs to enter @ after every 4th _, you can keep it as per your need too.

awk -v fieldNum="4" '
BEGIN{ FS=OFS="_" }
{
  val=""
  for(i=1;i<=NF;i  ){
    val=(val?val:"") $i (i%fieldNum==0?"@":(i<NF?OFS:""))
  }
  print val
}
'  Input_file