Faster drop-in replacement for bash `cut` for specific application-CodePudding

I have a very large tab separated file. The tab separated file is binary and will be streamed by the tool samtools (which is very fast and not the bottleneck). Now I want to output only the content up to the first tab.

In my current piped command cut is the bottleneck:

samtools view -@ 15 -F 0x100 file.bam | cut -f 1 | pigz > out.gz

I tried using awk '{print $1}'. This is not sufficiently faster I also tried using parallelin combination withcut` but this also does not increase the speed much.

I guess it would be better to have a tool which just outputs the string until first tab and then completely skips the entire line.

Do you have a suggestion for a tool which is faster for my purpose? Ideally, one would write a small C program I guess but my C is a bit rusty so would take too long for me.

CodePudding user response：

You are interested in a small C program that just outputs lines from stdin until first tab.

In C you can do this easily with something like this:

#include <stdio.h>
#include <string.h>

#define MAX_LINE_LENGTH 1024

int main(void) {
    char buf[MAX_LINE_LENGTH];

    while(fgets(buf, sizeof(buf), stdin) != NULL) {
        buf[strcspn(buf, "\n\t")] = '\0';
        fputs(buf, stdout);
        fputc('\n', stdout);
    }
    return 0;
}

It simply reads lines with fgets from stdin. The string is terminated with a NUL byte at the first tab \t. The same applies if there is a \n to have no extra line feeds in the output, just in case there is no tab on an input line.

Whether this is much faster in your use case I cannot say, but it should at least provide a starting point for trying out your idea.

CodePudding user response：

You might consider process-substitutions instead of a pipeline.

$ < <( < <(samtools view -@ 15 -F 0x100 file.bam) cut -f1 )  pigz

Note: I'm using process substitution to generate stdin and avoid using another FIFO. This seems to be much faster.

I've written a simple test script sam_test.sh that generates some output:

#!/usr/bin/env bash
echo {1..10000} | awk 'BEGIN{OFS="\t"}{$1=$1;for(i=1;i<=1000;  i) print i,$0}'

and compared the output of the following commands:

$ ./sam_test.sh | cut -f1 | awk '!(FNR%3)'
$ < <(./sam_test.sh) cut -f1 | awk '!(FNR%3)'
$ < <( < <(./sam_test.sh) cut -f1 ) awk '!(FNR%3)'

The later of the three cases is in 'runtime' significantly faster. Using strace -c , we can see that each pipeline adds a significant amount of wait4 syscalls. The final version is then also significantly faster (factor 700 in the above case).

Output of test case (short):

$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.22    0.643249      160812         4         1 wait4
  0.30    0.001951           5       334       294 openat
  0.21    0.001331           5       266       230 stat
  0.04    0.000290          20        14        12 execve
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00    0.648287         728       890       549 total

$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.72    0.256664       85554         3         1 wait4
  0.45    0.001181           3       334       294 openat
  0.29    0.000757           2       266       230 stat
<snip>
------ ----------- ----------- --------- --------- ----------------
100.00    0.259989         295       881       547 total

$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.43    0.002863        1431         2         1 wait4
 19.68    0.001429           4       334       294 openat
 14.87    0.001080           3       285       242 stat
 10.00    0.000726          51        14        12 execve
 <snip>
------ ----------- ----------- --------- --------- ----------------
100.00    0.007261           7       909       557 total

Output of test case (full):

$ cat ./sam_test_full_pipe.sh
#!/usr/bin/env bash
./sam_test.sh | cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_full_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.22    0.643249      160812         4         1 wait4
  0.30    0.001951           5       334       294 openat
  0.21    0.001331           5       266       230 stat
  0.04    0.000290          20        14        12 execve
  0.04    0.000276           6        42           mmap
  0.04    0.000229          76         3           clone
  0.03    0.000178           3        49         4 close
  0.02    0.000146           3        39           fstat
  0.02    0.000109           9        12           mprotect
  0.01    0.000080           5        16           read
  0.01    0.000053           2        18           rt_sigprocmask
  0.01    0.000052           3        16           rt_sigaction
  0.01    0.000038           3        10           brk
  0.01    0.000036          18         2           munmap
  0.01    0.000034           5         6         2 access
  0.00    0.000029           3         8         1 fcntl
  0.00    0.000024           3         7           lseek
  0.00    0.000019           4         4         3 ioctl
  0.00    0.000019           9         2           pipe
  0.00    0.000018           3         5           getuid
  0.00    0.000018           3         5           getgid
  0.00    0.000018           3         5           getegid
  0.00    0.000017           3         5           geteuid
  0.00    0.000013           4         3           dup2
  0.00    0.000013          13         1           faccessat
  0.00    0.000009           2         4         2 arch_prctl
  0.00    0.000008           4         2           getpid
  0.00    0.000008           4         2           prlimit64
  0.00    0.000005           5         1           sysinfo
  0.00    0.000004           4         1           write
  0.00    0.000004           4         1           uname
  0.00    0.000004           4         1           getppid
  0.00    0.000003           3         1           getpgrp
  0.00    0.000002           2         1           rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00    0.648287         728       890       549 total

$ cat ./sam_test_one_pipe.sh
#!/usr/bin/env bash
< <(./sam_test.sh) cut -f1 - | awk '!(FNR%3)' -
$ strace -c ./sam_test_one_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.72    0.256664       85554         3         1 wait4
  0.45    0.001181           3       334       294 openat
  0.29    0.000757           2       266       230 stat
  0.11    0.000281          20        14        12 execve
  0.08    0.000220           5        42           mmap
  0.06    0.000159          79         2           clone
  0.05    0.000138           3        45         2 close
  0.05    0.000125           3        39           fstat
  0.03    0.000083           6        12           mprotect
  0.02    0.000060           3        16           read
  0.02    0.000054           3        16           rt_sigaction
  0.02    0.000042           2        16           rt_sigprocmask
  0.01    0.000038           6         6         2 access
  0.01    0.000035          17         2           munmap
  0.01    0.000027           2        10           brk
  0.01    0.000019           3         5           getuid
  0.01    0.000018           3         5           geteuid
  0.01    0.000017           3         5           getgid
  0.01    0.000017           3         5           getegid
  0.00    0.000010           1         7           lseek
  0.00    0.000009           2         4         3 ioctl
  0.00    0.000008           4         2           getpid
  0.00    0.000007           1         4         2 arch_prctl
  0.00    0.000005           5         1           sysinfo
  0.00    0.000004           4         1           uname
  0.00    0.000003           3         1           getppid
  0.00    0.000003           3         1           getpgrp
  0.00    0.000003           1         2           prlimit64
  0.00    0.000002           2         1           rt_sigreturn
  0.00    0.000000           0         1           write
  0.00    0.000000           0         1           pipe
  0.00    0.000000           0         3           dup2
  0.00    0.000000           0         8         1 fcntl
  0.00    0.000000           0         1           faccessat
------ ----------- ----------- --------- --------- ----------------
100.00    0.259989         295       881       547 total

$ cat ./sam_test_no_pipe.sh
#!/usr/bin/env bash
< <(< <(./sam_test.sh) cut -f1 - ) awk '!(FNR%3)' -
$ strace -c ./sam_test_no_pipe.sh > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.43    0.002863        1431         2         1 wait4
 19.68    0.001429           4       334       294 openat
 14.87    0.001080           3       285       242 stat
 10.00    0.000726          51        14        12 execve
  2.67    0.000194           4        42           mmap
  1.83    0.000133           3        39           fstat
  1.67    0.000121         121         1           clone
  1.58    0.000115           2        41           close
  0.88    0.000064           6        10         2 access
  0.87    0.000063           5        12           mprotect
  0.73    0.000053           3        16           rt_sigaction
  0.70    0.000051           4        12           rt_sigprocmask
  0.66    0.000048           3        16           read
  0.48    0.000035           3        10           brk
  0.48    0.000035           3         9           getuid
  0.44    0.000032          16         2           munmap
  0.41    0.000030           3         8         1 fcntl
  0.41    0.000030           3         9           geteuid
  0.40    0.000029           3         9           getegid
  0.34    0.000025           2         9           getgid
  0.22    0.000016           5         3           dup2
  0.21    0.000015           3         4         3 ioctl
  0.19    0.000014           2         7           lseek
  0.18    0.000013          13         1           faccessat
  0.12    0.000009           2         4         2 arch_prctl
  0.11    0.000008           4         2           prlimit64
  0.08    0.000006           3         2           getpid
  0.06    0.000004           4         1           write
  0.06    0.000004           4         1           rt_sigreturn
  0.06    0.000004           4         1           uname
  0.06    0.000004           4         1           sysinfo
  0.06    0.000004           4         1           getppid
  0.06    0.000004           4         1           getpgrp
------ ----------- ----------- --------- --------- ----------------
100.00    0.007261           7       909       557 total

CodePudding user response：

GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

You might give a try other implementation of AWK, according to test done in 2009¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! nawk was found faster than gawk and mawk was found faster than nawk. You would need to run test with your data to find if using other implementation give noticeable boost.

¹so versions available in 2022 might give different result

CodePudding user response：

In the question OP has mentioned that awk '{print $1}' is not sufficiently faster than cut; in my testing I'm seeing awk running about twice as fast as cut, so not sure how OP is using awk ... or if I'm missing something (basic) with my testing ...

OP has mentioned a 'large' tab-delimited file with up to 400 characters per line; we'll simulate this with the following code that generates a ~400MB file:

$ cat sam_out.awk
awk '
BEGIN { OFS="\t"; x="1234567890"
        for (i=1;i<=40;i  )      filler=filler x
        for (i=1;i<=1000000;i  ) print x,filler
      }'
$ . ./sam_out.awk | wc
1000000 2000000 412000000

Test calls:

$ cat sam_tests.sh
echo "######### pipe to cut"
time . ./sam_out.awk | cut -f1 -         > /dev/null

echo "######### pipe to awk"
time . ./sam_out.awk | awk '{print $1}'  > /dev/null

echo "######### process-sub to cut"
time cut -f1          <(. ./sam_out.awk) > /dev/null

echo "######### process-sub to awk"
time awk '{print $1}' <(. ./sam_out.awk) > /dev/null

NOTE: also ran all 4 tests with output written to 4 distinct output files; diff of the 4 output files showed all were the same (wc: 1000000 1000000 11000000; head -1: 1234567890)

Results of running the tests:

######### pipe to cut

real    0m1.177s
user    0m0.205s
sys     0m1.454s

######### pipe to awk

real    0m0.582s
user    0m0.166s
sys     0m0.759s

######### process-sub to cut

real    0m1.265s
user    0m0.351s
sys     0m1.746s

######### process-sub to awk

real    0m0.655s
user    0m0.097s
sys     0m0.968s

NOTES:

test system: Ubuntu 10.04, cut (GNU coreutils 8.30), awk (GNU Awk 5.0.1)
earlier version of this answer showed awk running 14x-15x times faster than cut; that system: cygwin 3.3.5, cut (GNU coreutils 8.26), awk (GNU Awk 5.1.1)

CodePudding user response：

In the end I hacked a small C program which directly filters the BAM file and also writes to a gzip -- with a lot of help of the htslib developers (which is the basis for samtools).

So piping is not needed any more. This solution is about 3-4 times faster than the solution with the C code above (from Stephan).

See here: https://github.com/samtools/samtools/issues/1672