Optimizing a matrix transpose function with OpenMP-CodePudding

I have this code that transposes a matrix with loop tiling functionality.

void transposer(int n, int m, double *dst, const double *src) {
   int blocksize;
   for (int i = 0; i < n; i  = blocksize) {
      for (int j = 0; j < m; j  = blocksize) {
          // transpose the block beginning at [i,j]
          for (int k = i; k < i   blocksize;   k) {
              for (int l = j; l < j   blocksize;   l) {
                  dst[k   l*n] = src[l   k*m];
               }
          }
      }
   }
}

I want to optimize this with multithreading using OpenMP, however im not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesnt this just parallelize the outer loop?

CodePudding user response：

You can use the collapse specifier to parallelize over two loops.

#   pragma omp parallel for collapse(2) 
    for (int i = 0; i < n; i  = blocksize) {
        for (int j = 0; j < m; j  = blocksize) {
            // transpose the block beginning at [i,j]
            for (int k = i; k < i   blocksize;   k) {
                for (int l = j; l < j   blocksize;   l) {
                    dst[k   l*n] = src[l   k*m];
                }
            }
        }
    }

As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.

CodePudding user response：

When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.

In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).

You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.

CodePudding user response：

I thought about just adding #pragma omp parallel for but doesnt this just parallelize the outer loop?

Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:

(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.

You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.

void transposer(int n, int m, double *dst, const double *src) {
    int blocksize;
    #pragma omp parallel for collapse(...)
    for (int i = 0; i < n; i  = blocksize)
       for (int j = 0; j < m; j  = blocksize)
           for (int k = i; k < i   blocksize;   k
               for (int l = j; l < j   blocksize;   l)
                  dst[k   l*n] = src[l   k*m];
}

Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k l*n] = src[l k*m];)