Home > Back-end >  What is a fast way to read a matrix from a CSV file to NumPy if the size is known in advance?
What is a fast way to read a matrix from a CSV file to NumPy if the size is known in advance?

Time:02-04

I was tired of waiting while loading a simple distance matrix from a csv file using numpy.genfromtxt. Following another perfplot benchmark

The result for the largest input size shows that the best method is read_csv, which is this:

def load_read_csv(path: str):
    with open(path, 'r') as csv_file:
        reader = csv.reader(csv_file)
        matrix = None
        first_row = True
        for row_index, row in enumerate(reader):
            if first_row:
                size = len(row)
                matrix = np.zeros((size, size), dtype=int)
                first_row = False
            matrix[row_index] = row

    return matrix

Now I doubt that reading the file line by line, converting it to the list of strings, then calling int() on each item in the list and adding it to NumPy matrix is the best possible way.

Can this function be optimized further, or is there some fast library for CSV loading (like enter image description here

As you can see, the above Numba implementation is at least one order of magnitude faster than all others. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.

  •  Tags:  
  • Related