Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute v-CodePudding

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.

Here's a sample implementation (more complete version on godbolt):

#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1,  6,7, 4,5
#define SWAP32 1, 0,  3, 2,  5, 4,  7, 6

static bool any(vNb x) {
    x = x | __builtin_shufflevector(x,x, SWAP128);
    x = x | __builtin_shufflevector(x,x, SWAP64);
    x = x | __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
    vNs max = {0,0,0,0,0,0,0,0};
    vNs tmax;
    unsigned imax = 0;
    for (unsigned i = 0 ; i < n; i  = VLEN) {
        vNs t = *(vNs*)(data   i);
        t = -t < t ? t : -t;  // Absolute value
        vNb cmp = t > max;
        if (any(cmp)) {
            tmax = t; imax = i;
            // broadcast horizontal max of t into every element of max
            vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
            t = t < tswap128 ? tswap128 : t;
            vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
            t = t < tswap64 ? tswap64 : t;
            vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
            max = t < tswap32 ? tswap32 : t;
        }
    }
    // To simplify example, ignore finding index of true value in tmax==max
    *index = imax; //   which(tmax == max);
    return max[0];
}

Code on godbolt allows changing VLEN to 8 or 4.

This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.

The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.

But there's a problem with if (any(cmp)). For clang SSE/AVX, it works well, vcmpltps vptest, with an orps to go from 8->4 on SSE.

But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.

Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?

I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.

CodePudding user response：

As commented by chtz, the most generic and typical method is to have another mask to gather indices:

Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr)); 

for (auto i = 8; i   8 <= vec_length; i =8) { 
    Vec8s data = abs(load8(ptr[i]));
    auto mask = is_greater(data, max_abs);
    max_idx = bitselect(mask, indices, max_idx);
    max_abs = max(max_abs, data);    
    indices = indices   8;
}

Another option is to interleave the values and indices:

auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);

for (int i = 8; i   8 <= size; i =8) {
    idx = idx   8;
    auto d1 = load8s(ptr   i) & 0x7fffffff;
    auto lo1 = zip_lo(idx, d1);
    auto hi1 = zip_hi(idx, d1);
    lo = max_u64(lo, lo1);
    hi = max_u64(hi, hi1);
}

This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.

Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.

auto data0 = load8u(ptr) << 1; // take abs by shifting left 
auto data1 = (load8u(ptr   8) << 1)   1;  // encode odd index to data
auto mx = max_u32(data0, data1);  // the LSB contains one bit of index

Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).

CodePudding user response：

I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.

Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:

// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
    vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
    const __m128 greater = _mm_cmpgt_ps( vec, aa );
    aa = _mm_max_ps( vec, aa );
    // If you don't have SSE4, emulate with bitwise ops: and, andnot, or
    maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}

float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
    // Initialize things
    const float* const end = rsi   length;
    const float* const endAligned = rsi   ( ( length / 4 ) * 4 );

    __m128 aa = _mm_set1_ps( -1 );
    __m128 maxIdx = _mm_setzero_ps();
    __m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );

    // Main vectorized portion
    while( rsi < endAligned )
    {
        __m128 vec = _mm_loadu_ps( rsi );
        rsi  = 4;
        updateMax( vec, idx, aa, maxIdx );
        idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
    }

    // Handle the remainder, if present
    if( rsi < end )
    {
        __m128 vec;
        if( length > 4 )
        {
            // The source has at least 5 elements
            // Offset the source pointer   index back, by a few elements
            const int offset = (int)( 4 - ( length % 4 ) );
            rsi -= offset;
            idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
            vec = _mm_loadu_ps( rsi );
        }
        else
        {
            // The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
            alignas( 16 ) float buff[ 4 ];
            _mm_store_ps( buff, _mm_setzero_ps() );
            for( size_t i = 0; i < length; i   )
                buff[ i ] = rsi[ i ];
            vec = _mm_load_ps( buff );
        }

        updateMax( vec, idx, aa, maxIdx );
    }

    // Reduce to scalar
    __m128 tmpMax = _mm_movehl_ps( aa, aa );
    __m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
    __m128 greater = _mm_cmpgt_ps( tmpMax, aa );
    aa = _mm_max_ps( tmpMax, aa );
    maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );

    // SSE3 has 100% market penetration in 2022
    tmpMax = _mm_movehdup_ps( tmpMax );
    tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
    greater = _mm_cmpgt_ss( tmpMax, aa );
    aa = _mm_max_ss( tmpMax, aa );
    maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );

    index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
    return _mm_cvtss_f32( aa );
}

As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.

SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.

The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.