I'm processing some very big files, and my simple Go program to do this is taking 2 minutes to run instead of the 15 seconds it takes for the equivalent C program (https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc). Clearly I'm missing something important. (It's also my first Go program, so I'm sure the code is idiomatically poor, too.)
The files I'm processing are csv files, which look like the following, and the only issue is that they're GB in size.
BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0
I could have used the csv module, but Scanf converts the fields to the correct types for me, and that seemed simpler. All the code does is skip the header line, and then read the rest of the file line-by-line to create a histogram of the 4th field, and then write the histogram at the end.
import (
"bufio"
"fmt"
"os"
"log"
)
const num_channels int = 4096
func main () {
if len(os.Args) != 3 {
fmt.Printf("Usage: %s infile outfile\n", os.Args[0])
os.Exit(1)
}
fin, err := os.Open(os.Args[1])
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(fin)
scanner.Scan() // skip the header line
fout, err := os.Create(os.Args[2])
if err != nil {
log.Fatal(err)
}
fmt.Fprintf(fout, "channel,total\n")
var total[num_channels] uint64
var tmax int64
var board, channel, energy, energyshort, flags int
var timetag int64
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
total[energy] = 1
tmax = timetag
}
tmax_s := float64(tmax)/1e12
fmt.Println("tmax = ", tmax_s)
for i, val := range total {
fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)
}
}
Help? Thanks!
[Update, and a solution, and some weirdness]
I simplified things, so I could get a better feeling for what is going on with even simpler code. I removed the header line of the csv file I was using to test everything, and I also created an even shorter version of the csv file that I could share, in case anybody wanted an test case (https://grantgoodyear.org/files/sample60.csv).
Here's a streamlined C code:
#include <stdio.h>
int main(int argc, char* argv[]) {
FILE* fp = fopen(argv[1], "r");
int board, channel, energy, energyshort, flags;
long timetag;
double tmax = 0;
while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
tmax = timetag / 1.0e12;
}
printf("tmax: %f\n", tmax);
fclose(fp);
}
It processes the short and the 1.5GB files in 0.16s and 15s, respectively.
$ time cmaxt sample60.csv
tmax: 59.999983
real 0m0.160s
user 0m0.152s
sys 0m0.008s
$ time cmaxt long.csv
tmax: 7200.265511
real 0m14.730s
user 0m14.451s
sys 0m0.255s
In comparison, here's a nearly identical program in go:
import (
"io"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
var tmax float64
var board, channel, energy, energyshort, flags int
var timetag int64
for {
_, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
if err == io.EOF {
break
}
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
It takes an amazingly long time to run:
$ time gomaxt sample60.csv
tmax = 59.999983344
real 0m8.044s
user 0m4.677s
sys 0m3.555s
$ time gomaxt long.csv
tmax = 7200.265510652
real 18m37.472s
user 10m58.221s
sys 8m28.282s
I have no idea what's happening here, but it's taking 50-75 times longer than the C version. It's especially weird that the sys time is so long.
I changed the go version so that it more resembles my original post, using a bufio.NewScanner and fmt.Sscanf to do the splitting:
import (
"bufio"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
var board, channel, energy, energyshort, flags int
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
This version only(?!) takes 6 times longer than the C version:
$ time gomaxtorig sample60.csv
tmax = 59.999983344
real 0m0.895s
user 0m0.905s
sys 0m0.038s
$ time gomaxtorig long.csv
tmax = 7200.265510652
real 1m53.030s
user 2m1.039s
sys 0m3.021s
So now let's replace fmt.Sscanf with string splitting:
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
for scanner.Scan() {
ss := strings.Split(scanner.Text(), ";")
timetag, _ = strconv.ParseInt(ss[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
As suggested, the majority of time actually was being spent in fmt.Sscanf. This version takes slightly twice as long as the C version:
$ time ../gammaprocess_go/maxtsplit sample60.csv
tmax = 59.999983344
real 0m0.226s
user 0m0.243s
sys 0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv
tmax = 7200.265510652
real 0m26.434s
user 0m28.834s
sys 0m1.683s
I did write a slightly hacky version that forced the string conversion of the other fields in each row of the csv file, just to see if that made any difference, and the times were essentially identical to the above version.
Because I'm using Scanner.Text(), there is a lot of strings created and then thrown out, and it was suggested that I use bytes instead of strings. It looks to me like that's what the csv package does, so I just used it:
import (
"encoding/csv"
"fmt"
"os"
"io"
"strconv"
)
func main () {
fin, _ := os.Open(os.Args[1])
r := csv.NewReader(fin)
r.Comma = ';'
var tmax float64
var timetag int64
for {
rec, err := r.Read()
if err == io.EOF {
break
}
timetag, _ = strconv.ParseInt(rec[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
The timing is just slightly longer than the string-splitting version with Scanner.Text*():
$ time gomaxtcsv sample60.csv
tmax = 59.999983344
real 0m0.281s
user 0m0.300s
sys 0m0.019s
$ time gomaxtcsv long.csv
tmax = 7200.265510652
real 0m32.736s
user 0m35.619s
sys 0m1.702s
but presumably there's more overhead since the csv package is doing a lot more than my simple example, so I'd say that's inconclusive.
Regardless, I'm fine with a factor of 2 inefficiency, so I'm not going to try to keep optimizing this. Many thanks to the folks who answered this question.
[Another update]
Or just look at https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J, from 2014.
CodePudding user response:
sscanf takes most of the time. Do:
ss := strings.Split(scanner.Text(), ";")
board, _ = strconv.Atoi(ss[0])
channel, _ = strconv.Atoi(ss[1])
timetag, _ = strconv.Atoi(ss[2])
energy, _ = strconv.Atoi(ss[3])
flags, _ = strconv.ParseUint(ss[4], 16, 64)
Checking errors omitted.
