Context: in a POSIX-only project, I need to reformat a logfile, changing iso dates to UNIX timestamps. The log file looks like:
2022-01-11T18:22:46 0100 call_ring 3366
2022-01-11T19:36:54 0100 call_ring 33611
2022-01-12T07:49:15 0100 call_ring 33616
2022-01-12T08:57:20 0100 call_ring 33621
2022-01-12T09:42:56 0100 call_ring 33648
2022-01-12T12:20:48 0100 call_ring 3364
2022-01-12T12:28:01 0100 call_ring 3364
2022-01-12T13:16:31 0100 call_ring 33628
For now I use
awk -F'\t' '{cmd="date \" %s\t"$3"\" -d "$1;system(cmd)}' logs.tsv
But on large files it's unbearably slow (more than 50s for ~20k lines). I believe the system() function forking 20k process is the cause.
Is there a way to be faster in a POSIX shell script? I'd like to avoid Python or perl for this, too.
(Note that date -d is not POSIX, we plan to write an in-house binary for this. Therefore date -d will be accepted in the answer)
CodePudding user response:
<fart age=old mode=grumpy> kids today! </>
$ cat <<EOF >calctime.awk
# this simplified formula only works 2000-03 to 2100-02 which I assume
# should cover any logfile timestamps of interest today or the near future;
# it can fairly easily be extended as far back or forward as the Gregorian
# calendar was/remains in use
{ split($1,a,/[- T:]/);
t = a[2]<=2;
t = int((a[1]-2000-t)*365.25) int((a[2]-3 t*12)*30.6 .5) a[3]-1;
t = t*86400 a[4]*3600 a[5]*60 a[6] 951868800;
t = (substr($1,20,1)=="-"?-1: 1)*(substr(a[7],1,2)*3600 substr(a[7],3,2)*60);
print t,$3; }
EOF
$ awk -f calctime.awk -v OFS='\t' infile >outfile # or pipe to something
Time less than 0.1 second.
CodePudding user response:
This may not be directly answering your question, but I have benchmarked between several solutions by generating 20,000 lines as the posted log file.
OP's awk solution
#!/bin/sh
awk -F'\t' '{cmd="date \" %s\t"$3"\" -d "$1;system(cmd)}' logs.tsv > /dev/null
took 59.5 seconds to complete.
POSIX sh
#!/bin/sh
IFS=$(printf "\t")
while read -r a b c; do
echo $(date %s -d "$a")"$IFS$c"
done < logs.tsv > /dev/null
took 35.6 seconds.
date command only
#!/bin/sh
i=0
while [ "$i" -lt 20000 ]; do
date %s -d "2022-01-11T18:22:46 0100"
i=$(($i 1))
done > /dev/null
took 33.4 seconds.
It seems the most of the execution time is consumed by the date command.
If you plan to write your own substitution for date -d, it will be
significantly faster.
CodePudding user response:
Just in case you could relax your POSIX-only requirement for a 150 speed-up factor... With the GNU awk built-in mktime and gensub functions (tested with a 20k lines input):
$ time awk '{utc = gensub(/.*([ -][0-9] )/,"\\1",1,$1);
gsub(/-|T|:|\ [0-9] /," ",$1);
$1 = mktime($1, utc); print}' logs.tsv > tmp.tsv
real 0m0.290s
user 0m0.280s
sys 0m0.005s
$ head tmp.tsv
1641925366 call_ring 3366
1641929814 call_ring 33611
...
gensub extracts the UTC flags (e.g. 0100) from the first field, gsub reformats the first field as the space-separated string required by mktime (e.g. 2022 01 11 18 22 46).
This version attempts to determine whether daylight saving time is in effect for the specified time. Replace mktime($1, utc) by:
mktime($1 " 1", utc)if you wantawkto assume daylight saving time,mktime($1 " 0", utc)if you wantawkto assume standard time.
