Is MPI_Igather thread-safe?-CodePudding

I am trying to start a sequence of MPI_Igather calls (non-blocking collectives from MPI 4), do some work, and then whenever an Igather finishes, do some more work on that data.

That works fine, unless I start the Igather's from different threads on each MPI rank. In that case I often get a deadlock, even though I call MPI_Init_thread to make sure that MPI_THREAD_MULTIPLE is provided. Non-blocking collectives do not have a tag to match sends and receives, but I thought this is handled by the MPI_Request object associated with each collective operation?

The most simple example I found failing is this:

given np MPI ranks, each of which has a local array of length np
start one MPI_Igather for each element i, gathering those elements on process i.
the i-loop is parallelized using OpenMP
then call MPI_Waitall() to finish all communication. This is where the program hangs when setting OMP_NUM_THREADS to a value larger than 1.

I made two variants of this program: igather_threaded.cpp (the code below), which behaves as described above, and igather_threaded_v2.cpp, which gathers everything on MPI rank 0. This version does not deadlock, but the data is not ordered correctly either.

igather_threaded.cpp:

#include <iostream>
#include <memory>
#include <mpi.h>

#define PRINT_ARRAY(STR,PTR) \
  for (int r=0; r<nproc; r  ){\
    if (rank==r){\
      for (int j=0; j<nproc; j  ){\
        (STR) << (PTR)[j] << " ";\
      }\
      (STR) << std::endl << std::flush;\
    }\
    MPI_Barrier(MPI_COMM_WORLD);\
  }

int main(int argc, char* argv[])
{
int provided;
MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided);
if (provided!=MPI_THREAD_MULTIPLE){
  std::cerr << "Your MPI installation does not provide threading level MPI_THREAD_MULTIPLE, required for this program." << std::endl;
  MPI_Abort(MPI_COMM_WORLD, -1);
}
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);

  std::unique_ptr<double[]> M1(new double[nproc]);
  std::unique_ptr<double[]> M2(new double[nproc]);

  for (int j=0; j<nproc; j  ) M1[j] = rank*nproc j 1;

  PRINT_ARRAY(std::cout,M1);

std::unique_ptr<MPI_Request[]> requests(new MPI_Request[nproc]);

#pragma omp parallel for schedule(static) shared(requests)
for (int j=0; j<nproc; j  ){
  MPI_Igather(M1.get() j,  1,  MPI_DOUBLE,
              M2.get(), 1, MPI_DOUBLE, j,
              MPI_COMM_WORLD, &requests[j]);
}

  MPI_Waitall(nproc,requests.get(),MPI_STATUSES_IGNORE);

  if (rank==0) std::cout << " => "<<std::endl;
  PRINT_ARRAY(std::cout,M2);
  MPI_Finalize();
  return 0;
}

My question is: is my code theoretically correct and this is a bug in OpenMPI (4.0.3 used here), or did I miss anything that disallows starting MPI_Igather calls by multiple threads? I also tried it with Intel MPI with similar result.

To compile, I use OpenMPI 4.0.3 (configured to support MPI_THREAD_MULTIPLE), gcc 11.2.0, and the command line

> mpicxx -std=c  17 -fopenmp -o igather_threaded igather_threaded.cpp

To run, I use

mpirun -np 4 --bind-to none env OMP_NUM_THREADS=4 env OMP_PROC_BIND=false ./igather_threaded

You may have to increase the number of MPI ranks and/or OpenMP threads, and/or run it several times because the exact behavior is not deterministic.

CodePudding user response：

All your gathers go into the same M2array. If you do independent gathers, you need to have independent gather buffers.

Comment: it looks like you're trying to parallelize an Allgather. Since the send buffer is only a single number, the increased latency and thread overhead probably makes this less efficient than using Allgather.

(previous answer deleted)

CodePudding user response：

The principle of all non-blocking MPI calls is that you are not allowed to use the message buffers while the communication is in flight. Therefore, here, you cannot access M2 between a call to MPI_Igather() and the corresponding call to MPI_Wait*().

Here, not only you do it, but you do it nproc times. That means that you code is already very wrong even without parallelizing with OpenMP. The fact that the error only becomes visible with multithreaded version is just by chance. The code was wrong before that.