Can OpenMP's SIMD directive vectorize indexing operations?-CodePudding

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:

void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
               const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
               double& bfVal) {
    Eigen::VectorXd bfPTVec(idxt.size());
    #pragma omp simd
    for (int j = 0; j < idxt.size(); j  ) {
        int vIDX = static_cast<int>(idxt(j));
        double interp1 = vIDX   1 - idxt(j);
        double interp2 = idxt(j) - vIDX;
        bfPTVec(j) = (SIG(vIDX,j)*interp1   SIG(vIDX 1,j)*interp2);
    }
    bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}

I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.

Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.

CodePudding user response：

Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.

First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.

Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).

To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).

CodePudding user response：

Probably no, but I would expect manually vectorized version to be faster.

Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.

The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.

inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
    // Load the index value into both lanes of the vector
    __m128d idx = _mm_loaddup_pd( pIndex );

    // Convert into int32 with truncation; this assumes the number there ain't negative.
    const int iFloor = _mm_cvttsd_si32( idx );

    // Compute fractional part
    idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
    // Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
    idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );

    // Load two int16_t values from sequential addresses
    const __m128i signsInt = _mm_loadu_si32( pSignsColumn   iFloor );
    // Upcast them to int32, then to fp64
    const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );

    // Compute the result
    __m128d res = _mm_mul_pd( idx, signs );
    res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
    // The above 2 lines (3 instructions) can be replaced with the following one:
    // const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
    // It may or may not be better, the dppd instruction is not particularly fast.

    return _mm_cvtsd_f64( res );
}