Home > Mobile >  Go line-by-line processing taking way too long
Go line-by-line processing taking way too long

Time:01-29

I'm processing some very big files, and my simple Go program to do this is taking 2 minutes to run instead of the 15 seconds it takes for the equivalent C program (https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc). Clearly I'm missing something important. (It's also my first Go program, so I'm sure the code is idiomatically poor, too.)

The files I'm processing are csv files, which look like the following, and the only issue is that they're GB in size.

BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0

I could have used the csv module, but Scanf converts the fields to the correct types for me, and that seemed simpler. All the code does is skip the header line, and then read the rest of the file line-by-line to create a histogram of the 4th field, and then write the histogram at the end.

import (    
    "bufio"    
    "fmt"    
    "os"     
    "log"    
)    
     
const num_channels int = 4096    
    
func main () {    
    if len(os.Args) != 3 {    
        fmt.Printf("Usage: %s infile outfile\n", os.Args[0])    
        os.Exit(1)    
    }    
    fin, err := os.Open(os.Args[1])    
    if err != nil {    
        log.Fatal(err)    
    }    
    scanner := bufio.NewScanner(fin)               
    scanner.Scan() // skip the header line     
    fout, err := os.Create(os.Args[2])    
    if err != nil {    
        log.Fatal(err)    
    }    
    fmt.Fprintf(fout, "channel,total\n")    
    var total[num_channels] uint64    
    var tmax int64    
    var board, channel, energy, energyshort, flags int    
    var timetag int64    
    for scanner.Scan() {    
        fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)    
        total[energy]  = 1    
        tmax = timetag        
    }                                                                                    
    tmax_s := float64(tmax)/1e12    
    fmt.Println("tmax = ", tmax_s)    
    for i, val := range total {         
        fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)    
    }    
}        

Help? Thanks!

[Update, and a solution, and some weirdness]

I simplified things, so I could get a better feeling for what is going on with even simpler code. I removed the header line of the csv file I was using to test everything, and I also created an even shorter version of the csv file that I could share, in case anybody wanted an test case (https://grantgoodyear.org/files/sample60.csv).

Here's a streamlined C code:

#include <stdio.h>

int main(int argc, char* argv[]) {
  FILE* fp = fopen(argv[1], "r");
  int board, channel, energy, energyshort, flags;
  long timetag;
  double tmax = 0;
  while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
    tmax = timetag / 1.0e12;
  }
  printf("tmax: %f\n", tmax);
  fclose(fp);
}

It processes the short and the 1.5GB files in 0.16s and 15s, respectively.

$ time cmaxt sample60.csv 
tmax: 59.999983

real    0m0.160s
user    0m0.152s
sys 0m0.008s
$ time cmaxt long.csv 
tmax: 7200.265511

real    0m14.730s
user    0m14.451s
sys 0m0.255s

In comparison, here's a nearly identical program in go:

import (
    "io"
    "fmt"
    "os"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    var tmax float64
    var board, channel, energy, energyshort, flags int
    var timetag int64
    for {
        _, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
        if err == io.EOF {
            break
        }
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

It takes an amazingly long time to run:

$ time gomaxt sample60.csv 
tmax =  59.999983344

real    0m8.044s
user    0m4.677s
sys 0m3.555s
$ time gomaxt long.csv 
tmax =  7200.265510652

real    18m37.472s
user    10m58.221s
sys 8m28.282s

I have no idea what's happening here, but it's taking 50-75 times longer than the C version. It's especially weird that the sys time is so long.

I changed the go version so that it more resembles my original post, using a bufio.NewScanner and fmt.Sscanf to do the splitting:

 import (
    "bufio"
    "fmt"
    "os"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    var board, channel, energy, energyshort, flags int
    for scanner.Scan() {
        fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

This version only(?!) takes 6 times longer than the C version:

$ time gomaxtorig sample60.csv 
tmax =  59.999983344

real    0m0.895s
user    0m0.905s
sys 0m0.038s
$ time gomaxtorig long.csv 
tmax =  7200.265510652

real    1m53.030s
user    2m1.039s
sys 0m3.021s

So now let's replace fmt.Sscanf with string splitting:

import (
    "bufio"
    "fmt"
    "os"
    "strconv"
    "strings"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    scanner := bufio.NewScanner(fin)
    var tmax float64
    var timetag int64
    for scanner.Scan() {
        ss := strings.Split(scanner.Text(), ";")
        timetag, _ = strconv.ParseInt(ss[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

As suggested, the majority of time actually was being spent in fmt.Sscanf. This version takes slightly twice as long as the C version:

$ time ../gammaprocess_go/maxtsplit sample60.csv 
tmax =  59.999983344

real    0m0.226s
user    0m0.243s
sys 0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv 
tmax =  7200.265510652

real    0m26.434s
user    0m28.834s
sys 0m1.683s

I did write a slightly hacky version that forced the string conversion of the other fields in each row of the csv file, just to see if that made any difference, and the times were essentially identical to the above version.

Because I'm using Scanner.Text(), there is a lot of strings created and then thrown out, and it was suggested that I use bytes instead of strings. It looks to me like that's what the csv package does, so I just used it:

import (
    "encoding/csv"
    "fmt"
    "os"
    "io"
    "strconv"
)


func main () {
    fin, _ := os.Open(os.Args[1])
    r := csv.NewReader(fin)
    r.Comma = ';'

    var tmax float64
    var timetag int64
    for {
        rec, err := r.Read()
        if err == io.EOF {
            break
        }
        timetag, _ = strconv.ParseInt(rec[2], 10, 64)
        tmax = float64(timetag)/1e12
    }
    fmt.Println("tmax = ", tmax)
}

The timing is just slightly longer than the string-splitting version with Scanner.Text*():

$ time gomaxtcsv sample60.csv 
tmax =  59.999983344

real    0m0.281s
user    0m0.300s
sys 0m0.019s
$ time gomaxtcsv long.csv 
tmax =  7200.265510652

real    0m32.736s
user    0m35.619s
sys 0m1.702s

but presumably there's more overhead since the csv package is doing a lot more than my simple example, so I'd say that's inconclusive.

Regardless, I'm fine with a factor of 2 inefficiency, so I'm not going to try to keep optimizing this. Many thanks to the folks who answered this question.

[Another update]

Or just look at https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J, from 2014.

CodePudding user response:

sscanf takes most of the time. Do:

        ss := strings.Split(scanner.Text(), ";")
        board, _ = strconv.Atoi(ss[0])
        channel, _ = strconv.Atoi(ss[1])
        timetag, _ = strconv.Atoi(ss[2])
        energy, _ = strconv.Atoi(ss[3])
        flags, _ = strconv.ParseUint(ss[4], 16, 64)

Checking errors omitted.

  •  Tags:  
  • Related