OpenMP parallelize grouped array sum using pointers-CodePudding

I want to effectively parallelize the following sum in C:

  #pragma omp parallel for num_threads(nth)
  for(int i = 0; i < l;   i) pout[pg[i]]  = px[i];

where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.

The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.

CodePudding user response：

From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.

Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.

Here is the crucial bit of code:

#pragma omp parallel shared(sum,threadsum)
  {
    int thread = omp_get_thread_num(),
      myfirst = thread*ngroups;
    #pragma omp for
    for ( int i=0; i<biglen; i   )
      threadsum[ myfirst indexes[i] ]  = 1;
    #pragma omp for
    for ( int igrp=0; igrp<ngroups; igrp   )
      for ( int t=0; t<nthreads; t   )
        sum[igrp]  = threadsum[ t*ngroups igrp ];
  }

Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.

CodePudding user response：

Your code has a data race (at line pout[pg[i]] = ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5 , the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction( :pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] = .., but its performance is questionable.