Performance of numba (did I do something wrong?)-CodePudding

I am currently writing an algorithm which traverses a graph in python. This graph is connected to an underlying equation system and during the traversal, I have to extract and store some indices. I implemented it using networkx at first, but because the equation systems and also the connected graphs get quite big, the algorithm was too slow.

I then switched to a pure numpy implementation. This was faster, but still not fast enough. I thought that numba would be even faster, but it seems to be even slower. After I measured the computation time, I noticed that the main problem arises during the call of the following function:

@jit(nopython=True)
def add_col_to_mad_schedule_numba(rows, cols, data,  mad_schedule_col, edges, col, total_fillin_rows, total_fillin_cols,
                            total_fillin_data, nodes):

    colidx = nbfunc.where_single(cols, col)
    for k in range(len(rows[colidx])):
        if nbfunc.contained(nodes, rows[colidx][k]):
            edge_idx = nbfunc.where_single(edges[:, 0], rows[colidx][k])
            if edges[edge_idx].size != 0:
                kstart = edges[edge_idx, 0]
                ending = False
                while ending == False:
                    edges, k_filter = det_edges(kstart, edges)
                    k_filter = np.array(k_filter)
                    if k_filter.size == 0:
                        ending=True
                    else:
                        rows, cols, data, total_fillin_rows, total_fillin_cols, total_fillin_data = det_fillin(rows, cols, data, total_fillin_rows, total_fillin_cols, total_fillin_data, col, k_filter, edges)
                        first_idx, sec_idx, third_idx = mad_numba(rows, cols, edges, k_filter, col)

                        newl = np.zeros((len(edges[k_filter, 1]), 6), dtype=np.int64)
                        newl[:, 0] = edges[k_filter, 0]
                        newl[:, 1] = edges[k_filter, 1]
                        newl[:, 2] = first_idx
                        newl[:, 3] = sec_idx
                        newl[:, 4] = third_idx
                        newl[:, 5] = col


                        mad_schedule_col = np.append(mad_schedule_col, newl, axis=0)

                        kstart = edges[k_filter, 1]


    return rows, cols, data, total_fillin_rows, total_fillin_cols, total_fillin_data, mad_schedule_col[1:]

This function gets called n times, where n is the number of variables in the equation system. Each run of the function currently takes 61 ms, and I would like to ask if you can see any kind of technical bottleneck which arises due to a wrong usage of numba. For example, I am still creating numpy arrays in the function body. Might something like this lead to a bad performance?

The algorithm is quite time consuming indeed, because for every non zero entry in each column of the system (variable k), the directed graph is traversed until there is no successor left. The numbers of traversals is not that high. There are ~3 iterations in the while loop. For every column, there are also only 3-5 nonzero entries.

I can also provide the contents of det_fillin() and mad_numba(), but there do not happen a lot of things, I think. I retrieve some indices using my own numba equivalents of the numpy where() function.

Please note that also the nbfunc functions represent equivalents to the numpy functions. Where_single() corresponds to np.where and contained() simply checks if rows[colidx][k] is in nodes. All functions are compiled with @jit(nopython=True) and there are no error messages.

CodePudding user response：

The problem appears mainly in the np.append call. It create a new bigger array and copy the previous content for each call which significantly increase the complexity of the algorithm (an algorithm with a linear complexity can become quadratic in the worst case). The same thing applies with Numpy.

One solution to fix this problem is to use a list so to append many Numpy arrays in it and then concatenate all of them in a new bigger Numpy array.

Another solution is to directly create a big array with the good size and then fill it by copying the sub-arrays to the big one. This solution is much faster but it requires to know the size of the final array ahead of time. It is especially fast when the code can be executed in parallel.

Yet another solution is to use the previous one with two minor changes: the big array is created with the biggest possible size and only the actually-filled subset of it (a view) is returned by the computing function. This solution requires the size of the big array to be bounded and the bound not to be much bigger than the actual average size of the returned view. This is generally faster than using lists but it takes more memory.

Notes & Remarks

Be careful: please note that mad_schedule_col = ... does not writes-to/modify the array passed in parameter, it creates a new array and the variable mad_schedule_col is set so to reference the newly-created array. If you want to mutate the input array, then you need to write into it. If you do not know the size ahead of time, then the best is simply to return the modified mad_schedule_col.

Note also that the Numba functions are compiled lazily (when the function is executed for the first time) if no signature is provided and the compilation time can be quite slow. Providing a signature causes Numba to compile a function eagerly (when the function is defined).

Note that if newl.shape[0] if big, then your algorithm could be memory-bound. If so, then using Numba will not be much faster because the memory will be already saturated.

CodePudding user response：

This is only partially an answer, as I updated the function according to your advice. You were right about the np.append functions. I replaced them with large arrays or lists where possible and the time percall shrank to 25 ms. This is still a lot of time, I think. Additionally, adding the signatures increased the computation time a little, so this did not help unfortunately.

I noticed however something strange: I prepared the whole code for you, along with the necessary input variables and stored them using np.save, so you can take a closer look at it.

I also added a small test script, which simply runs the function add_col_to_mad_schedule_numba in a for loop, so that the profiler puts it on top of the list. After running this test script using the stored input, I noticed that in the test script, the function has a per call time of only 2 ms, compared to the 25 ms of the computation where the function is embedded in a larger script. How is this possible?

There still seems to be something wrong. What I try to build is a parallel sparse solver for linear equation systems. Nothing new, scipy has SuperLU implemented, which also uses graph traversals at some stage (but works differently otherwise). However, the same equation system solved in SuperLU takes in the whole only a few ms or so. I know that it will probably not be possible to achieve these times using numba, but there have to be options to further decrease the time for this graph traversal.

If you have a look at the code and want to run the test script, note that:

You have to modify the file paths in the with open commands in the main section.
If you have a look at the numba functions, you will note that there are still some commands where I use np.append. This is because other ways lead to strange error messages. I will take care of this later.

The code, together with the inputs, can be downloaded via this link:

https://drive.google.com/drive/folders/1hBvFmmZtJ2rRf5VaQFJX8pjCQtLSD9-0?usp=sharing