AVX512: How to convert first 8 bytes into 8 64-bit integers?-CodePudding

I have an __m512i inputVector where each of the 64 bytes contains some offset. Next I need to add the first 8 byte offsets to 8 64-bit values stored in another __m512i variable (base). (In order to process all 64 bytes offsets I repeat the code below 4 times). However before I can do the vector addition of 8 packed 64-bit integers I need to convert the first 8 bytes into 8 64-bit integers. Currently my code uses two _mm512_cvtepu*_epi*() intrinsics to achieve this:

  // Convert first 16 bytes from inputVector into 16 32-bit values
  __m512i v16 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(inputVector, 0));
  // Convert first 8 32-bit values into 8 64-bit values
  __m512i v8 = _mm512_cvtepu32_epi64(_mm512_extracti32x8_epi32(v16, 0));
  
  // Finally do the addition
  res = _mm512_add_epi64(base, v8);

Is there a better way to achieve this? My solution feels clumsy, I don't have much experience with AVX.

CodePudding user response：

_mm512_cvtepu8_epi64 does exactly what you want for the low 8 bytes. It only looks at the low 8 bytes of the input __m128i; you don't need to do anything special to make your 8 elements fill an __m128i. https://felixcloutier.com/x86/pmovzx

You can just _mm512_castsi512_si128 in case some compilers fail to optimize extract(v,0) to zero asm instructions.

For higher chunks, you can just load 8 byte chunks from memory in the first place so you can directly feed vpmovzxbq instructions, instead of loading 64 bytes and having to shuffle high qwords down to the bottom.

__m512i pmovzxbq(const char *p) {
               // yes loadl takes a __m128i* pointer but only actually loads the low 8 bytes of it.  Pretty poor API design in Intel's early SSE/SSE2 intrinsics
  __m128i bytes = _mm_loadl_epi64((const __m128i*)p);        // intrinsic for vmovq, but should optimize away into a memory source for pmovzx
  __m512i qwords = _mm512_cvtepu8_epi64(bytes);
  return qwords;
}

This compiles nicely with GCC9 and clang5.0 and later (https://godbolt.org/z/vdxxGssoz). Earlier versions fail to fold the load into a memory source operand, doing a separate vmovq. (Although the memory-source version doesn't micro-fuse the load anyway on current Intel with a YMM or wider destination.)

pmovzxbq:
        vpmovzxbq       zmm0, QWORD PTR [rdi]
        ret

Shuffling high parts of wider vectors

If you do want to do wider loads, or have the packed bytes in SIMD registers for some other reason (like the result of some computation), you have 2 options:

Shuffle 8-byte chunks to the bottom for vpmovzxbq
Manually do the shuffle that takes bytes from where they are (instead of the bottom), and puts them at the bottom of 8-byte chunks of the destination vector with other elements zeroed

The first can be done with valignq to right-shift / rotate a vector to bring the part you want to the bottom. (Immediate-control shuffles like _mm512_extracti32x4_epi32 can only work in 16-byte chunks; and general shuffles like vpermq only allow an immediate control operand up to YMM, beyond that there's only a version with a ZMM control vector.)

  // normally worse than just doing 8-byte loads unless v is in a register already
  __m512i chunk3 = _mm512_alignr_epi64(v,v, 3);      // rotate right by 3 qwords
  __m512i v3unpacked = _mm512_cvtepu8_epi64(_mm512_castsi512_si128(chunk3));

(GCC11 compiles as written; clang13 pessimizes to vextracti128 / vpshufd: https://godbolt.org/z/v1hfa7Ph3)

This avoids needing to load any constants or set up any mask registers, and valignq is only single uop with 3c latency on Intel CPUs, and supported since AVX-512F (i.e. Skylake-AVX512). https://uops.info/.

But that's an extra instruction inside a loop, like an extra load uop would be, but worse it competes for the shuffle unit on port 5 vs. vpmovzxbq, hurting throughput unless the surrounding code does plenty of work using other ports.

Inside a loop, with AVX-512VBMI `vpermb`

This requires Ice Lake or later for AVX512 VBMI (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). If that's available, you can use vpermb with zero-masking to send each byte where it's needed, and zero all the rest. i.e. zero-extend the bytes from an 8-byte chunk of the source vector, all with one byte-shuffle that runs as a single uop on CPUs that support it.

 __m512i vin;
 
 const __m512i shuf1 = _mm512_setr_epi64(8,9,10,11,12,13,14,15);  // hopefully a compiler implements this with `vpmovzxbq` from memory, but probably will spend a full 64 bytes
 const __m512i shuf2 = _mm512_add_epi64(shuf1, _mm512_set1_epi64(8));  // probably compilers will load each of these separately instead of generating on the fly...
 const __m512i shuf3 = _mm512_add_epi64(shuf2, _mm512_set1_epi64(8));
 const __mmask64 zextmask = 0x0101010101010101;    // zero except for low byte of each qword.

// all the above outside a loop, or hopefully compilers can hoist them

 __m512i v0unpacked = _mm512_cvtepu8_epi64(vin);      // special case
 __m512i v1unpacked = _mm512_maskz_permutexvar_epi8(zextmask, shuf1, vin);
 __m512i v2unpacked = _mm512_maskz_permutexvar_epi8(zextmask, shuf2, vin);
  ...

(e.g. https://godbolt.org/z/czdjTeK41 - GCC loads 64-byte vector constants from memory, clang's shuffle optimizer turns it into memory-source vpmovzx instructions like vpmovzxbq zmm2, qword ptr [rsi rax 16]!)

So this is nice in a loop that justifies all that work setting up constants, saving 1 load or shuffle uop per output vector inside the loop.

Otherwise just load 8 bytes at a time instead of 64 bytes, like clang already does if you use this on the result of _mm512_loadu_si512.

CodePudding user response：

Based on your comments I was able to improve my code to:

  __m128i bytes16 = _mm512_castsi512_si128(inputVector);
  __m512i res = _mm512_cvtepu8_epi64(bytes16);
  res = _mm512_add_epi64(base, res);

For the high 8 bytes it seems slightly more difficult. Both methods that I tried used more instructions than the code for the low 8 bytes. As far as performance is concerned both methods ran equally fast in my test:

First solution:

  // Select bytes 8-15
  __m512i high8Bytes = _mm512_maskz_compress_epi8(0x000000000000ff00ull, inputVector);
  res = _mm512_cvtepu8_epi64(_mm512_castsi512_si128(high8Bytes));
  res = _mm512_add_epi64(base, res);

And the 2nd solution (my original code):

  __m512i words32 = _mm512_cvtepu8_epi32(_mm512_castsi512_si128(inputVector));
  __m512i res = _mm512_cvtepu32_epi64(_mm512_extracti32x8_epi32(words32, 1));
  res = _mm512_add_epi64(base, res);

Shuffling high parts of wider vectors

Inside a loop, with AVX-512VBMI vpermb

Inside a loop, with AVX-512VBMI `vpermb`