Bloated code generated for __builtin_popcnt when -mavx2 is on-CodePudding

For such a function, clang (and sometimes gcc in certain contexts that I cannot reproduce minimally) seems to generate bloated code when -mavx2 switch is on.

unsigned count(uint64_t *f) {
    unsigned c = 0;
    for (unsigned i = 0; i < 1024;   i) {
        if (sizeof(long) >= 8) {
            c  = __builtin_popcountl(f[i]);
        } else {
            c  = __builtin_popcountll(f[i]);
        }
    }
    return c;
}

This is from gcc and it's quite straightforward.

count:
        lea     rcx, [rdi 8192]
        xor     eax, eax
.L2:
        xor     edx, edx
        add     rdi, 8
        popcnt  rdx, QWORD PTR [rdi-8]
        add     eax, edx
        cmp     rcx, rdi
        jne     .L2
        ret

However clang decides to generate this massive bloat when -mavx2 is on. -mpopcnt was also set.

.LCPI0_0:
        .zero   32,15
.LCPI0_1:
        .byte   0                               # 0x0
        .byte   1                               # 0x1
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   3                               # 0x3
        .byte   4                               # 0x4
        .byte   0                               # 0x0
        .byte   1                               # 0x1
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   1                               # 0x1
        .byte   2                               # 0x2
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   2                               # 0x2
        .byte   3                               # 0x3
        .byte   3                               # 0x3
        .byte   4                               # 0x4
count:                                  # @count
        vpxor   xmm0, xmm0, xmm0
        xor     eax, eax
        vmovdqa ymm1, ymmword ptr [rip   .LCPI0_0] # ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vmovdqa ymm2, ymmword ptr [rip   .LCPI0_1] # ymm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
        vpxor   xmm12, xmm12, xmm12
        vpxor   xmm4, xmm4, xmm4
        vpxor   xmm5, xmm5, xmm5
        vpxor   xmm6, xmm6, xmm6
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm7, ymmword ptr [rdi   8*rax]
        vmovdqu ymm8, ymmword ptr [rdi   8*rax   32]
        vmovdqu ymm9, ymmword ptr [rdi   8*rax   64]
        vmovdqu ymm10, ymmword ptr [rdi   8*rax   96]
        vpand   ymm11, ymm7, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm7, ymm7, 4
        vpand   ymm7, ymm7, ymm1
        vpshufb ymm7, ymm2, ymm7
        vpaddb  ymm7, ymm11, ymm7
        vpsadbw ymm7, ymm12, ymm7
        vpand   ymm11, ymm8, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm8, ymm8, 4
        vpand   ymm8, ymm8, ymm1
        vpshufb ymm8, ymm2, ymm8
        vpaddb  ymm8, ymm8, ymm11
        vpsadbw ymm8, ymm8, ymm12
        vpand   ymm11, ymm9, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm9, ymm9, 4
        vpand   ymm9, ymm9, ymm1
        vpshufb ymm9, ymm2, ymm9
        vpaddb  ymm9, ymm9, ymm11
        vpsadbw ymm9, ymm9, ymm12
        vpand   ymm11, ymm10, ymm1
        vpshufb ymm11, ymm2, ymm11
        vpsrlw  ymm10, ymm10, 4
        vpand   ymm10, ymm10, ymm1
        vpshufb ymm10, ymm2, ymm10
        vpaddb  ymm10, ymm10, ymm11
        vpsadbw ymm10, ymm10, ymm12
        vextracti128    xmm3, ymm7, 1
        vpackusdw       xmm3, xmm7, xmm3
        vpaddd  xmm0, xmm0, xmm3
        vextracti128    xmm3, ymm8, 1
        vpackusdw       xmm3, xmm8, xmm3
        vpaddd  xmm4, xmm4, xmm3
        vextracti128    xmm3, ymm9, 1
        vpackusdw       xmm3, xmm9, xmm3
        vpaddd  xmm5, xmm5, xmm3
        vextracti128    xmm3, ymm10, 1
        vpackusdw       xmm3, xmm10, xmm3
        vpaddd  xmm6, xmm6, xmm3
        add     rax, 16
        cmp     rax, 1024
        jne     .LBB0_1
        vpaddd  xmm0, xmm4, xmm0
        vpaddd  xmm0, xmm5, xmm0
        vpaddd  xmm0, xmm6, xmm0
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        vzeroupper
        ret

clang's code is similar to gcc when only -mpopcnt is on, with a bit of unrolling.

count:                                  # @count
        xor     ecx, ecx
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        popcnt  rdx, qword ptr [rdi   8*rcx]
        add     edx, eax
        popcnt  rsi, qword ptr [rdi   8*rcx   8]
        add     esi, edx
        popcnt  rdx, qword ptr [rdi   8*rcx   16]
        popcnt  rax, qword ptr [rdi   8*rcx   24]
        add     edx, esi
        add     eax, edx
        add     rcx, 4
        cmp     rcx, 1024
        jne     .LBB0_1
        ret

According to this document (https://www.agner.org/optimize/instruction_tables.pdf), popcnt is a very cheap instruction on most architectures. Then why is clang generating such a bloat to replace popcnt when I clearly allowed to use it with -mpopcnt? The optimization level was all set to -O3.

Here is a link to godbolt (https://godbolt.org/z/4vWK33a7c).

CodePudding user response：

It's auto-vectorizing as well as unrolling, which is a performance win for large arrays (or would be if clang had less overhead), at least on Intel CPUs where popcnt is 1/clock, so 64 bits per clock. (AMD Zen has 3 or 4/clock popcnt, so with add instructions taking an equal amount of the 4 scalar-integer ALU ports, it could sustain 2/clock uint64_t popcnt load and add.) https://uops.info/

But vpshufb is also 1/clock on Intel (or 2/clock on Ice Lake), and if it's the bottleneck that's 128 bits of popcount work per clock. (Doing table lookups for the low 4 bits of each of 32 bytes.) But it's certainly not going to be that good, with all the extra shuffling it's doing inside the loop. :/

This vectorization loses on Zen1 where the SIMD ALUs are only 256 bits wide, but should be a significant win on Intel, and maybe a win on Zen2 and later.

But looks like clang widens to 32-bit counts inside the inner loop with vpsadbw, so it's not as good as it could be. 1024x uint64_t is 256 __m256i vectors of input data, and clang is unrolling by 4 so the max count in any one element is only 64, which can't overflow.

Clang is unrolling a surprising amount, given how much work it does. The vextracti128 and vpackusdw don't make much sense to me, IDK why it would do that inside the loop. The simple way to vectorize without overflow risk is just vpsadbw -> vpaddq or vpaddd, and it's already using vpsadbw for horizontal byte sums within 8-byte chunks. (A better way is to defer that until just before the byte elements could overflow, so do a few vpaddb. Like in How to count character occurrences using SIMD, although the byte counters are only incremented by 0 or 1 there, rather than 0 .. 8)

See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, especially Wojciech Muła's big-array popcnt functions: https://github.com/WojciechMula/sse-popcount/ - clang is using the same strategy as popcnt_AVX2_lookup but with a much less efficient way to accumulate the results across iterations.