i run a test enviroment where i created 40 000 testfiles with lorem alg. the files are between 200k and 5 MB big. I wanna modify lots of random files. I will change 5% of the lines by delete 2 lines and insert 1 line with base64 string.
the probleme is that this procedere needs to much time per file. i try to fix with copying testfile to ram and change it there, but i see a single thread that use only one full core and gawk show the most cpu work. i'm looking for some solutions, but i dont find the right advice. i think gawk could do this in one step but for big files i get a to long string when i caculate with "getconf ARG_MAX".
how can i speed this up?
zeilen=$(wc -l < testfile$filecount.txt);
durchlauf=$(($zeilen/20))
zeilen=$((zeilen-2))
for (( c=1; c<=durchlauf; c ))
do
zeile=$(shuf -i 1-$zeilen -n 1);
zeile2=$((zeile 1))
zeile3=$((zeile2 1))
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
if [[ $c -eq 1 ]]
then
gawk -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next;print} \
NR==n2{next; print} NR==n3{print s}1' testfile$filecount.txt > /mnt/RAM/tempfile.tmp
else
gawk -i inplace -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next; print} \
NR==n2{next; print} NR==n3{print s}1' /mnt/RAM/tempfile.tmp
fi
done
CodePudding user response:
I don't know what the rest of your script is doing but below will give you the idea how to vastly improve it's performance.
Instead of this which calls base64, tr, head, and awk on each iteration of the loop with all of the overhead that implies:
for (( c=1; c<=3; c ))
do
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
echo "$string" | awk '{print "<" $0 ">"}'
done
<nSxzxmRQc11 fFnG7ET4EBIBUwoflPo9Mop0j50C1MtRoLNjb43aNTMNRSMePTnGub5gqDWeV4yEyCVYC2s519JL5OLpBFxSS/xOjbL4pkmoFqOceX3DTmsZrl/RG YLXxiLBjL//I220MQAzpQE5bpfQiQB6BvRw64HbhtVzHYMODbQU1UYLeM6IMXdzPgsQyghv1MCFvs0Nl4Mez2Zh98f9 472c6K 44nmi>
<9xfgBc1Y7P/QJkB6PCIfNg0b7V KmSUS49uU7XdT yiBqjTLcNaETpMhpMSt3MLs9GFDCQs9TWKx7yXgbNch1p849IQrjhtZCa0H5rtCXJbbngc3oF9LYY8WT72RPiV/gk4wJrAKYq8/lKYzu0Hms0lHaOmd4qcz1hpzubP7NuiBjvv16A8T3slVG1p4vwxa5JyfgYIYo4rno219ba/vRMB1QF9HaAppdRMP32>
<K5kNgv9EN1a/c/7eatrivNeUzKYolCrz5tHE2yZ6XNm1aT4ZZq3OaY5UgnwF8ePIpMKVw5LZNstVwFdVaNvtL6JreCkcO QtebsCYg5sAwIdozwXFs4F4hZ/ygoz3DEeMWYgFTcgFnfoCV2Rct2bg/mAcJBZ9 4x9IS JNTA64T1Zl FJiCuHS05sFIsZYBCqRADp2iL3xcTr913dNplqUvBEEsW1qCk/TDwQh>
you should write this which only calls each tool once and so will run orders of magnitude faster:
$ base64 /dev/urandom | tr -dc '[[:print:]]' |
gawk -v RS='.{230}' '{print "<" RT ">"} NR==3{exit}'
<X0If1qkQItVLDOmh2BFYyswBgKFZvEwyA WglyU0BhqWHLzURt/AIRgL3olCWZebktfwBU6sK7N3nwK6QV2g5VheXIY7qPzkzKUYJXWvgGcrIoyd9tLUjkM3eusuTTp4TwNY6E/z7lT0/2oQrLH/yZr2hgAm8IXDVgWNkICw81BRPUqITNt3VqmYt/HKnL4d/i88F4QDE0XgivHzWAk6OLowtmWAiT8k1a0Me6>
<TqCyRXj31xsFcZS87vbA50rYKq4cvIIn1oCtN6PJcIsSUSjG8hIhfP8zwhzi6iC33HfL96JfLIBcLrojOIkd7WGGXcHsn0F0XVauOR t8SRqv /t9ggDuVsn6MsY2R4J mppTMB3fcC5787u0dO5vO1UTFWZG0ZCzxvX/3oxbExXb8M54WL6PZQsNrVnKtkvllAT/s4mKsQ/ojXNB0CTw7L6AvB9HU7W2x U3j>
<ESsGZlHjX/nslhJD5kJGsFvdMp PC5KA xOYlcTbc/t9aXoHhAJuy/KdjoGq6VkP v4eQ5lNURdyxs jMHqLVVtGwFYSlc61MgCt0IefpgpU2e2werIQAsrDKKT1DWTfbH1qaesTy2IhTKcEFlW/mc 1en8912Dig7Nn2MD8VQrGn6BzvgjzeGRqGLAtWJWkzQjfx 74ffJQUXW4uuEXA8lBvbuJ8 yQA2WHK5>
CodePudding user response:
@mark-fuso, Wow, thats incredibly fast! But there is a mistake in the script. The file grows in size a little bit, which is something I have to avoid. I think if two random line numbers ($durchlauf) are following each other, then one line is not deleted. Honestly, I dont completely understand what your command is doing, but it works very well. I think for such a task, I have to have more bash experience.
Sample output:
64
65
66
gOf0Vvb9OyXY1Tjb1r4jkDWC4VIBpQAYnSY7KkT1gl5MfnkCMzUmN798pkgEVAlRgV9GXpknme46yZURCaAjeg6G5f1Fc7nc7AquIGnEER>
AFwB9cnHWu6SRnsupYCPViTC9XK fwGkiHvEXrtw2aosTGAAFyu0GI8Ri2 NoJAvMw4mv/FE72t/xapmG5wjKpQYsBXYyZ9YVV0SE6c6rL>
70
71
CodePudding user response:
Assumptions:
- generate
$durchlauf(a number) random line numbers; we'll refer to a single number asn... - delete lines numbered
nandn 1from the input file and in their place ... - insert
$string(a randomly generatedbase64string) - this list of random line numbers must not have any consecutive line numbers
As others have pointed out you want to limit yourself to a single gawk call per input file.
New approach:
- generate
$durchlauf(count) random numbers (seegen_numbers()function) - generate
$durchlauf(count)base64strings (we'll reuse Ed Morton's code) pastethese 2 sets of data into a single input stream/file- feed 2 files to
gawk... thepasteresult and the actual file to be modified - we won't be able to use
gawk's-i inplaceso we'll use an intermediate tmp file - when we find a matching line in our input file we'll 1) insert the
base64string and then 2) skip/delete the current/next input lines; this should address the issue where we have two random numbers that are different by1
One idea to insure we do not generate consecutive line numbers:
- break our set of line numbers into ranges, eg, 100 lines split into 5 ranges =>
1-20/21-40/41-60/61-80/81-100 - reduce the end of each range by 1, eg,
1-19/21-39/41-59/61-79/81-99 - use
$RANDOMto generate numbers between each range (this tends to be at least a magnitude faster than comparableshufcalls)
We'll use a function to generate our list of non-consecutive line numbers:
gen_numbers () {
max=$1 # $zeilen eg, 100
count=$2 # $durchlauf eg, 5
interval=$(( max / count )) # eg, 100 / 5 = 20
for (( start=1; start<max; start=start interval ))
do
end=$(( start interval - 2 ))
out=$(( ( RANDOM % interval ) start ))
[[ $out -gt $end ]] && out=${end}
echo ${out}
done
}
Sample run:
$ zeilen=100
$ durchlauf=5
$ gen_numbers ${zeilen} ${durchlauf}
17
31
54
64
86
Demonstration of the paste/gen_numbers/base64/tr/gawk idea:
$ zeilen=300
$ durchlauf=3
$ paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )
This generates:
74 7VFhnDN4J...snip...rwnofLv
142 ZYv07oKMB...snip...xhVynvw
261 gifbwFCXY...snip...hWYio3e
Main code:
tmpfile=$(mktemp)
while/for loop ... # whatever OP is using to loop over list of input files
do
zeilen=$(wc -l < "testfile${filecount}".txt)
durchlauf=$(( $zeilen/20 ))
awk '
# process 1st file (ie, paste/gen_numbers/base64/tr/gawk)
FNR==NR { ins[$1]=$2 # store base64 in ins[] array
del[$1]=del[($1) 1] # make note of zeilen and zeilen 1 line numbers for deletion
next
}
# process 2nd file
FNR in ins { print ins[FNR] } # insert base64 string?
! (FNR in del) # if current line number not in del[] array then print the line
' <( paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )) "testfile${filecount}".txt > "${tmpfile}"
# the last line with line continuations for readability:
#' <( paste \
# <( gen_numbers ${zeilen} ${durchlauf} ) \
# <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) \
# ) \
#"testfile${filecount}".txt > "${tmpfile}"
mv "${tmpfile}" "testfile${filecount}".txt
done
Simple example of awk code in action:
$ cat orig.txt
line1
line2
line3
line4
line5
line6
line7
line8
line9
$ cat paste.out # simulated output from paste/gen_numbers/base64/tr/gawk
1 newline1
5 newline5
$ awk '...' paste.out orig.txt
newline1
line3
line4
newline5
line7
line8
line9
