Efficiently format dates from a log file with POSIX tools-CodePudding

Context: in a POSIX-only project, I need to reformat a logfile, changing iso dates to UNIX timestamps. The log file looks like:

2022-01-11T18:22:46 0100    call_ring    3366
2022-01-11T19:36:54 0100    call_ring    33611
2022-01-12T07:49:15 0100    call_ring    33616
2022-01-12T08:57:20 0100    call_ring    33621
2022-01-12T09:42:56 0100    call_ring    33648
2022-01-12T12:20:48 0100    call_ring    3364
2022-01-12T12:28:01 0100    call_ring    3364
2022-01-12T13:16:31 0100    call_ring    33628

For now I use

awk -F'\t' '{cmd="date \" %s\t"$3"\" -d "$1;system(cmd)}' logs.tsv

But on large files it's unbearably slow (more than 50s for ~20k lines). I believe the system() function forking 20k process is the cause.

Is there a way to be faster in a POSIX shell script? I'd like to avoid Python or perl for this, too.

(Note that date -d is not POSIX, we plan to write an in-house binary for this. Therefore date -d will be accepted in the answer)

CodePudding user response：

<fart age=old mode=grumpy> kids today! </>

$ cat <<EOF >calctime.awk
# this simplified formula only works 2000-03 to 2100-02 which I assume 
# should cover any logfile timestamps of interest today or the near future;
# it can fairly easily be extended as far back or forward as the Gregorian 
# calendar was/remains in use
{ split($1,a,/[- T:]/);
  t = a[2]<=2;
  t = int((a[1]-2000-t)*365.25)   int((a[2]-3 t*12)*30.6 .5)   a[3]-1;
  t = t*86400   a[4]*3600   a[5]*60   a[6]   951868800;
  t  = (substr($1,20,1)=="-"?-1: 1)*(substr(a[7],1,2)*3600 substr(a[7],3,2)*60);
  print t,$3; }
EOF
$ awk -f calctime.awk -v OFS='\t' infile >outfile # or pipe to something

Time less than 0.1 second.

CodePudding user response：

This may not be directly answering your question, but I have benchmarked between several solutions by generating 20,000 lines as the posted log file.

OP's awk solution

#!/bin/sh

awk -F'\t' '{cmd="date \" %s\t"$3"\" -d "$1;system(cmd)}' logs.tsv > /dev/null

took 59.5 seconds to complete.

POSIX sh

#!/bin/sh

IFS=$(printf "\t")

while read -r a b c; do
    echo $(date  %s -d "$a")"$IFS$c"
done < logs.tsv > /dev/null

took 35.6 seconds.

`date` command only

#!/bin/sh

i=0
while [ "$i" -lt 20000 ]; do
    date  %s -d "2022-01-11T18:22:46 0100"
    i=$(($i 1))
done > /dev/null

took 33.4 seconds.

It seems the most of the execution time is consumed by the date command. If you plan to write your own substitution for date -d, it will be significantly faster.

CodePudding user response：

Just in case you could relax your POSIX-only requirement for a 150 speed-up factor... With the GNU awk built-in mktime and gensub functions (tested with a 20k lines input):

$ time awk '{utc = gensub(/.*([ -][0-9] )/,"\\1",1,$1);
             gsub(/-|T|:|\ [0-9] /," ",$1);
             $1 = mktime($1, utc); print}' logs.tsv > tmp.tsv

real    0m0.290s
user    0m0.280s
sys     0m0.005s

$ head tmp.tsv
1641925366 call_ring  3366
1641929814 call_ring  33611
...

gensub extracts the UTC flags (e.g. 0100) from the first field, gsub reformats the first field as the space-separated string required by mktime (e.g. 2022 01 11 18 22 46).

This version attempts to determine whether daylight saving time is in effect for the specified time. Replace mktime($1, utc) by:

mktime($1 " 1", utc) if you want awk to assume daylight saving time,
mktime($1 " 0", utc) if you want awk to assume standard time.

OP's awk solution

POSIX sh

date command only

`date` command only