Slow speed with gawk for multiple edits the same file-CodePudding

i run a test enviroment where i created 40 000 testfiles with lorem alg. the files are between 200k and 5 MB big. I wanna modify lots of random files. I will change 5% of the lines by delete 2 lines and insert 1 line with base64 string.

the probleme is that this procedere needs to much time per file. i try to fix with copying testfile to ram and change it there, but i see a single thread that use only one full core and gawk show the most cpu work. i'm looking for some solutions, but i dont find the right advice. i think gawk could do this in one step but for big files i get a to long string when i caculate with "getconf ARG_MAX".

how can i speed this up?

zeilen=$(wc -l < testfile$filecount.txt);
    
    durchlauf=$(($zeilen/20))
    zeilen=$((zeilen-2))
    for (( c=1; c<=durchlauf; c   ))
    do
        zeile=$(shuf -i 1-$zeilen -n 1);
        
        zeile2=$((zeile 1))
        zeile3=$((zeile2 1))
        
        string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
        
        if [[ $c -eq 1 ]] 
        then
        gawk -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next;print} \
        NR==n2{next; print} NR==n3{print s}1' testfile$filecount.txt > /mnt/RAM/tempfile.tmp
        else
        gawk -i inplace -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next; print} \
        NR==n2{next; print} NR==n3{print s}1' /mnt/RAM/tempfile.tmp
        fi
       
    done

CodePudding user response：

I don't know what the rest of your script is doing but below will give you the idea how to vastly improve it's performance.

Instead of this which calls base64, tr, head, and awk on each iteration of the loop with all of the overhead that implies:

for (( c=1; c<=3; c   ))
do
    string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
    echo "$string" | awk '{print "<" $0 ">"}'
done
<nSxzxmRQc11 fFnG7ET4EBIBUwoflPo9Mop0j50C1MtRoLNjb43aNTMNRSMePTnGub5gqDWeV4yEyCVYC2s519JL5OLpBFxSS/xOjbL4pkmoFqOceX3DTmsZrl/RG YLXxiLBjL//I220MQAzpQE5bpfQiQB6BvRw64HbhtVzHYMODbQU1UYLeM6IMXdzPgsQyghv1MCFvs0Nl4Mez2Zh98f9 472c6K 44nmi>
<9xfgBc1Y7P/QJkB6PCIfNg0b7V KmSUS49uU7XdT yiBqjTLcNaETpMhpMSt3MLs9GFDCQs9TWKx7yXgbNch1p849IQrjhtZCa0H5rtCXJbbngc3oF9LYY8WT72RPiV/gk4wJrAKYq8/lKYzu0Hms0lHaOmd4qcz1hpzubP7NuiBjvv16A8T3slVG1p4vwxa5JyfgYIYo4rno219ba/vRMB1QF9HaAppdRMP32>
<K5kNgv9EN1a/c/7eatrivNeUzKYolCrz5tHE2yZ6XNm1aT4ZZq3OaY5UgnwF8ePIpMKVw5LZNstVwFdVaNvtL6JreCkcO QtebsCYg5sAwIdozwXFs4F4hZ/ygoz3DEeMWYgFTcgFnfoCV2Rct2bg/mAcJBZ9 4x9IS JNTA64T1Zl FJiCuHS05sFIsZYBCqRADp2iL3xcTr913dNplqUvBEEsW1qCk/TDwQh>

you should write this which only calls each tool once and so will run orders of magnitude faster:

$ base64 /dev/urandom | tr -dc '[[:print:]]' |
    gawk -v RS='.{230}' '{print "<" RT ">"} NR==3{exit}'
<X0If1qkQItVLDOmh2BFYyswBgKFZvEwyA WglyU0BhqWHLzURt/AIRgL3olCWZebktfwBU6sK7N3nwK6QV2g5VheXIY7qPzkzKUYJXWvgGcrIoyd9tLUjkM3eusuTTp4TwNY6E/z7lT0/2oQrLH/yZr2hgAm8IXDVgWNkICw81BRPUqITNt3VqmYt/HKnL4d/i88F4QDE0XgivHzWAk6OLowtmWAiT8k1a0Me6>
<TqCyRXj31xsFcZS87vbA50rYKq4cvIIn1oCtN6PJcIsSUSjG8hIhfP8zwhzi6iC33HfL96JfLIBcLrojOIkd7WGGXcHsn0F0XVauOR t8SRqv /t9ggDuVsn6MsY2R4J mppTMB3fcC5787u0dO5vO1UTFWZG0ZCzxvX/3oxbExXb8M54WL6PZQsNrVnKtkvllAT/s4mKsQ/ojXNB0CTw7L6AvB9HU7W2x U3j>
<ESsGZlHjX/nslhJD5kJGsFvdMp PC5KA xOYlcTbc/t9aXoHhAJuy/KdjoGq6VkP v4eQ5lNURdyxs jMHqLVVtGwFYSlc61MgCt0IefpgpU2e2werIQAsrDKKT1DWTfbH1qaesTy2IhTKcEFlW/mc 1en8912Dig7Nn2MD8VQrGn6BzvgjzeGRqGLAtWJWkzQjfx 74ffJQUXW4uuEXA8lBvbuJ8 yQA2WHK5>

CodePudding user response：

@mark-fuso, Wow, thats incredibly fast! But there is a mistake in the script. The file grows in size a little bit, which is something I have to avoid. I think if two random line numbers ($durchlauf) are following each other, then one line is not deleted. Honestly, I dont completely understand what your command is doing, but it works very well. I think for such a task, I have to have more bash experience.

Sample output:

64
65
66
gOf0Vvb9OyXY1Tjb1r4jkDWC4VIBpQAYnSY7KkT1gl5MfnkCMzUmN798pkgEVAlRgV9GXpknme46yZURCaAjeg6G5f1Fc7nc7AquIGnEER>
AFwB9cnHWu6SRnsupYCPViTC9XK fwGkiHvEXrtw2aosTGAAFyu0GI8Ri2 NoJAvMw4mv/FE72t/xapmG5wjKpQYsBXYyZ9YVV0SE6c6rL>
70
71

CodePudding user response：

Assumptions:

generate $durchlauf (a number) random line numbers; we'll refer to a single number as n ...
delete lines numbered n and n 1 from the input file and in their place ...
insert $string (a randomly generated base64 string)
this list of random line numbers must not have any consecutive line numbers

As others have pointed out you want to limit yourself to a single gawk call per input file.

New approach:

generate $durchlauf (count) random numbers (see gen_numbers() function)
generate $durchlauf (count) base64 strings (we'll reuse Ed Morton's code)
paste these 2 sets of data into a single input stream/file
feed 2 files to gawk ... the paste result and the actual file to be modified
we won't be able to use gawk's -i inplace so we'll use an intermediate tmp file
when we find a matching line in our input file we'll 1) insert the base64 string and then 2) skip/delete the current/next input lines; this should address the issue where we have two random numbers that are different by 1

One idea to insure we do not generate consecutive line numbers:

break our set of line numbers into ranges, eg, 100 lines split into 5 ranges => 1-20 / 21-40 / 41-60 / 61-80 / 81-100
reduce the end of each range by 1, eg, 1-19 / 21-39 / 41-59 / 61-79 / 81-99
use $RANDOM to generate numbers between each range (this tends to be at least a magnitude faster than comparable shuf calls)

We'll use a function to generate our list of non-consecutive line numbers:

gen_numbers () {

max=$1                             # $zeilen     eg, 100
count=$2                           # $durchlauf  eg, 5

interval=$(( max / count ))        # eg, 100 / 5 = 20

for (( start=1; start<max; start=start interval ))
do
        end=$(( start   interval - 2 ))

        out=$(( ( RANDOM % interval )   start ))
        [[ $out -gt $end ]] && out=${end}

        echo ${out}
done
}

Sample run:

$ zeilen=100
$ durchlauf=5
$ gen_numbers ${zeilen} ${durchlauf}
17
31
54
64
86

Demonstration of the paste/gen_numbers/base64/tr/gawk idea:

$ zeilen=300
$ durchlauf=3
$ paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )

This generates:

74      7VFhnDN4J...snip...rwnofLv
142     ZYv07oKMB...snip...xhVynvw
261     gifbwFCXY...snip...hWYio3e

Main code:

tmpfile=$(mktemp)

while/for loop ... # whatever OP is using to loop over list of input files
do
    zeilen=$(wc -l < "testfile${filecount}".txt)
    durchlauf=$(( $zeilen/20 ))

    awk '

    # process 1st file (ie, paste/gen_numbers/base64/tr/gawk)

    FNR==NR        { ins[$1]=$2                 # store base64 in ins[] array
                     del[$1]=del[($1) 1]        # make note of zeilen and zeilen 1 line numbers for deletion
                     next
                   }

    # process 2nd file

    FNR in ins     { print ins[FNR] }           # insert base64 string?

    ! (FNR in del)                              # if current line number not in del[] array then print the line

    ' <( paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )) "testfile${filecount}".txt > "${tmpfile}"

    # the last line with line continuations for readability:
    #' <( paste \
    #         <( gen_numbers ${zeilen} ${durchlauf} ) \
    #         <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) \
    #   ) \
    #"testfile${filecount}".txt > "${tmpfile}"

    mv "${tmpfile}" "testfile${filecount}".txt

done

Simple example of awk code in action:

$ cat orig.txt
line1
line2
line3
line4
line5
line6
line7
line8
line9

$ cat paste.out           # simulated output from paste/gen_numbers/base64/tr/gawk
1 newline1
5 newline5

$ awk '...' paste.out orig.txt
newline1
line3
line4
newline5
line7
line8
line9