How can I improve the speed of my large txt processing script?-CodePudding

I have a program that scans a very large txt file (.pts file actually) that looks like this :

437288479
-6.9465 -20.49 -1.3345 70
-6.6835 -20.82 -1.3335 83
-7.3105 -20.179 -1.3325 77
-7.1005 -20.846 -1.3295 96
-7.3645 -20.759 -1.2585 79
...

The first line is the number of points contained in the file, and every other line corresponds to a {x,y,z,intensity} point in a 3D space. This file above is ~11 GB but I have more files to process that can be up to ~50 GB.

Here's the code I use to read this file :

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};


void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("my_big_file.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);
    std::string line;

    // get rid of the first line
    std::getline(is, line);
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > minThreshold && p.z < maxThreshold)
        {
            // do something with p and store it in the vector of tuples
            // O(n) complexity
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

For my 11 GB file, scanning all the lines and storing data in point p takes ~13 minutes to execute. Is there a way to make it way faster ? Because each time I scan a point, I also have to do some stuff with it. Which will make my program to take several hours to execute in the end.

I started looking into using several cores but it seems it could be problematic if some points are linked together for some reason. If you have any advice on how you would proceed, I'll gladly hear about it.

Edit1 : I'm running the program on a laptop with a CPU containing 8 cores - 2.9GHz, ram is 16GB and I'm using an ssd. The program has to run on similar hardware for this purpose.

Edit2 : Here's the complete program so you can tell me what I've been doing wrong. I localize each point in a sort of 2D grid called slab. Each cell will contain a certain amount of points and a z mean value.

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};

/*
    Compute Slab
*/

float slabBox[6] = {-25.,25.,-25.,25.,-1.,0.};
float dx = 0.1;
float dy = 0.1;
int slabSizeX = (slabBox[1] - slabBox[0]) / dx;
int slabSizeY = (slabBox[3] - slabBox[2]) / dy;

std::vector<std::tuple<double, double, double, int>> initSlab() 
{
    // initialize the slab vector according to the grid size
    std::vector<std::tuple<double, double, double, int>> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});

    // fill the vector with {x,y} cells coordinates
    for (int y = 0; y < slabSizeY; y  )
    {
        for (int x = 0; x < slabSizeX; x  )
        {
            slabVector[x   y * slabSizeX] = {x * dx   slabBox[0], y * dy   slabBox[2], 0., 0};
        }
    }
    return slabVector;
}

std::vector<std::tuple<double, double, double, int>> addPoint2Slab(point p, std::vector<std::tuple<double, double, double, int>> slabVector)
{
    // find the region {x,y} in the slab in which coord {p.x,p.y} is
    int x = (int) floor((p.x - slabBox[0])/dx);
    int y = (int) floor((p.y - slabBox[2])/dy);
    
    // calculate the new z value
    double z = (std::get<2>(slabVector[x   y * slabSizeX]) * std::get<3>(slabVector[x   y * slabSizeX])   p.z) / (std::get<3>(slabVector[x   y * slabSizeX])   1);

    // replace the older z
    std::get<2>(slabVector[x   y * slabSizeX]) = z;

    // add   1 point in the cell
    std::get<3>(slabVector[x   y * slabSizeX])  ;
    return slabVector;
}

/*
    Parse the file 
*/

void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("T032_OSE.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);

    std::string line;
    std::getline(is, line);

    auto slabVector = initSlab();
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > slabBox[4] && p.z < slabBox[5])
        {
            slabVector = addPoint2Slab(p, slabVector);
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

CodePudding user response：

If you use HDD to store your file just reading with 100Mb/s will spend ~2min and it is a good case. Try to read a block of the file and process it in another thread while the next block will be reading.

Also, you have something like:

std::vector<...> addPoint2Slab(point, std::vector<...> result)
{
    ...
    return result;
}

slabVector = addPoint2Slab(point, slabVector);

I suppose it will bring unnecessary copying of the slabVector on every call (actually, a compiler might optimize it). Try to check speed if you pass vector as follow:

std::vector<...> addPoint2Slab(point, std::vector<...> & result);

And call:

addPoint2Slab(point, slabVector);

And if it will get a speed bonus you can check how to forward results without the overhead.

CodePudding user response：

Get rid of std::getline. iostreams are pretty slow compared to direct "inmemory" processing of strings. Also do not use sscanf.

Allocate a large chunk of memory, i.e. 128MB or more. Read all of it from file in one call. Then parse this chunk until you reach the end.

Sort of like this:

std::vector<char> huge_chunk(128*1024*1024);
ifstream in("my_file");
do {
   in.read(huge_chunk.data(), huge_chunk.size());
   parse(huge_chunk.data, in.gcount());
} while (in.good());

you get the idea.

Parse the chunk with strtof, find and the like.

Parsing the chunk will leave a few characters at the end of the chunk which do not form a complete line. You need to store them temporarily and resume parsing the next chunk from there.

Generally speaking: The fewer calls to ifstream, the better. And using "lower API" functions such as strtof, strtoul etc... is usually faster than sscanf, format etc...

This usually does not matter for small files <1MB, but can make a huge difference with very large files.

Also: Use a profiler to find out exactly where your program is waiting. Intels VTune profiler is free, afaik. It is part of the OneAPI Toolkit and is one of the best tools I know.