I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0 to M , the second one frommy_first_coord to my_last_coord. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code
#pragma omp parallel
...
for (int i = 0; i < M; i ) { //All iterate from i to M
# pragma omp for nowait
for (int coord = 0; coord < N; coord ) { //each works on its portion of coords
centroids[points[i].cluster].accumulator.coordinates[coord] = points[i].coordinates[coord];
}
}
I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas
/*M is global,
first_nn and last_nn are local*/
for (long i = 0; i < M; i )
for(long coord = first_nn; coord <= last_nn; coord )
centroids[points[i].cluster].accumulator.coordinates[coord] = points[i].coordinates[coord];
I hope that it is clear enough. Thank you
Edit:
I'm using gcc 12.2.0. By adding the -O3 flag times have improved.
With larger inputs the difference is speedup between 4 and 8 threads is more significant.
CodePudding user response:
Your comment indicates that you are worried about speedup.
- How many physical cores does your processor have? Try every thread count from 1 to that number.
- Do not use hyperthreads
- You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
- You could try setting
OMP_PROC_BIND=truewhich prevents the OS from migrating your threads. That can improve cache usage. - You have some sort of indirect addressing going on with the
ivariable so further memory effects related to the TLB may make your parallel code not scale optimally.
But start with point 3 and report.
CodePudding user response:
I solved my problem thanks to the comments.
