Problems in parallelizing radix sort in C with CUDA-CodePudding

I am trying to implement a radix sort algorithm for exercise using CUDA in C to be able to parallelize it; the code is as follows:

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <cuda_profiler_api.h>

int* populateArray(int * arr, int n){
    int j=0;
    for (int i=n; i>0 ; i--){
        arr[j]= i;
        j=j 1;
    }
    return arr;
    }

void printArray(int * array, int size){
    printf("Ordered array:\n");
    for (int j=0; j<size; j  ){
        printf("%d ", array[j]);
    }
    printf("\n");
}

__device__  int getMax(int* array, int n) {
  int max = array[0];
  for (int i = 1; i < n; i  )
    if (array[i] > max)
      max = array[i];
  return max;
}

__device__ void countingSort(int* array,int size,int digit, int index, int* output) {
  int count[10]={0};

  for (int i = 0; i < size; i  )
    count[(array[i] / digit) % 10]  ;

  for (int i = 1; i < 10; i  )
    count[i]  = count[i - 1];

  for (int i = size - 1; i >= 0; i--) {
    output[count[(array[i] /digit) % 10] - 1] = array[i];
    count[(array[i] / digit) % 10]--;
    }

  for (int i = 0; i < size; i  )
    array[i]= output[i];
  }


__global__ void radixsort(int* array, int size, int* output) {
  // define index
  int index = blockIdx.x * blockDim.x   threadIdx.x;
  
  // check that the thread is not out of the vector boundary
  if (index >= size) return;

  int max = getMax(array, size);
  for (int digit = 1; max / digit > 0; digit *= 10){
    countingSort(array, size, digit, index, output);
    }
 
}

int main(int argc, char *argv[]) {
    // Init array
    int n = 1000;
    int* array_h = (int*) malloc(sizeof(int) * n);
    populateArray(array_h, n);
    
    // allocate memory on device
    int* array_dev;
    cudaMalloc((void**)&array_dev, n*sizeof(int));
    int* output;
    cudaMalloc((void**)&output, n*sizeof(int));

    // copy data from host to device
    cudaMemcpy(array_dev, array_h, n*sizeof(int), cudaMemcpyHostToDevice);

    dim3 block(32);
    dim3 grid((n-1)/block.x   1);
    printf("Number of threads for each block: %d\n", block.x);
    printf("Number of blocks in the grid: %d\n", grid.x);

    // Create start and stop CUDA events 
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    cudaError_t mycudaerror;
    mycudaerror = cudaGetLastError();
    // define the execution configuration
    radixsort<<<grid,block>>>(array_dev, n, output);

    // device synchronization and cudaGetLastError call
    mycudaerror = cudaGetLastError();
    if(mycudaerror != cudaSuccess)  {
      fprintf(stderr,"%s\n",cudaGetErrorString(mycudaerror)) ;
      //printf("Error in kernel!");
      exit(1);
    }

    // event record, synchronization, elapsed time and destruction
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float elapsed;
    cudaEventElapsedTime(&elapsed, start, stop);
    elapsed = elapsed/1000.f; // convert to seconds
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    printf("Number of elements in the array: %d\n", n);
    printf("Kernel elapsed time: %.5f s\n", elapsed);

    // copy back results from device
    cudaMemcpy(array_h, array_dev, n*sizeof(int), cudaMemcpyDeviceToHost);

    // print ordered array
    printArray(array_h, n);
    
    // free resources on device
    cudaFree(array_dev);
    cudaFree(output);

    // free resources on host
    free(array_h);

    cudaProfilerStop();
    return 0; 
}

In order to run it, I'm using Google Colab. The maximum number of threads for each block is fixed at 32 (grid variable), while the number of blocks used is calculated in the main based on how many elements need to be sorted (block variable).

The problem arises when I start to change the number of elements inside the array to be sorted (the variable "n" present in main) as once a certain threshold is exceeded, the sorting is no longer performed correctly.

In order to get more information about this wrong execution, I also used the commands
cuda-memcheck and nvcc -lineinfo and I found that the error occurs due to this line of code in the kernel:

output[count[(array[i] /digit) % 10] - 1] = array[i];

Making several attempts, the error mainly seems to be in the computation of the index of the "output" array I am trying to write to; however, this error does not occur when, for example, I try to sort 32 or 64 elements. I would therefore like to know if I am doing something wrong in the code or if, simply, the radix sort is not parallelizable in the way I am trying. I know that having each thread do the sorting without using the index relative to each thread is very heavy computationally speaking, but first I wanted to try to solve this problem and then try to optimize the code in general.

Various proven approaches include:

use of atomic instructions;
declare the "output" array inside the kernel and don't pass it as a pointer from main;
use of only one block of threads for the calculation of the sorted array (an approach that strangely seems to work, but which is useless given the dynamic allocation of the number of blocks)

Thanks for any replies.

CodePudding user response：

You seem to already understand that this is not the way to approach this problem:

I know that having each thread do the sorting without using the index relative to each thread is very heavy computationally speaking, but first I wanted to try to solve this problem and then try to optimize the code in general.

Nearly any code that is serial in nature can be dropped in a CUDA kernel, run in a single thread, and it should produce the same result. However this is not the way to write CUDA codes; the performance will be dismal.

In many cases, algorithms that are serial in nature are not readily adaptable to parallelization. Your sorting approach is one of them. A typical method to (efficiently) parallelize radix sort is here.

Nevertheless, dropping a serial algorithm into a single thread "should" work. The problem arises here when you run the same serial code (unnecessarily) in many threads.

If all threads were perfectly in lockstep, they would all be doing exactly the same thing, in the same instruction or clock cycle. However GPUs don't work that way. About the largest "lockstep" you can witness is at the warp level (32 threads in a threadblock). As soon as you go to multiple warps, whether they are in the same threadblock or different threadblocks, then you will have threads at different points in your serial code, effectively stepping on each other.

A simple proof-point for this is to change your grid configuration to a single thread:

dim3 block(1);
dim3 grid(1);

Then the errors disappear, even for n = 1000, and the array is properly sorted. Because of the typical lockstep nature of a single warp, this should also work:

dim3 block(32);
dim3 grid(1);

Certain larger configurations might work in some cases, but with a large enough number of blocks, eventually all threads will not start at the same time, and this leads to trouble.

CodePudding user response：

An alternate solution is to use a single thread with a most significant digit radix sort to separate the array into multiple bins, such as 256 bins if using base 256 After this initial step, each of the bins can be sorted independently and in parallel using conventional least significant digit first radix sort. For this to work well, the data needs to be reasonably uniform (at least for the most significant digit) so that the multiple bins are somewhat similar in size.

Trying to do the initial step in parallel is complicated.