For such a function, clang (and sometimes gcc in certain contexts that I cannot reproduce minimally) seems to generate bloated code when -mavx2 switch is on.
unsigned count(uint64_t *f) {
unsigned c = 0;
for (unsigned i = 0; i < 1024; i) {
if (sizeof(long) >= 8) {
c = __builtin_popcountl(f[i]);
} else {
c = __builtin_popcountll(f[i]);
}
}
return c;
}
This is from gcc and it's quite straightforward.
count:
lea rcx, [rdi 8192]
xor eax, eax
.L2:
xor edx, edx
add rdi, 8
popcnt rdx, QWORD PTR [rdi-8]
add eax, edx
cmp rcx, rdi
jne .L2
ret
However clang decides to generate this massive bloat when -mavx2 is on. -mpopcnt was also set.
.LCPI0_0:
.zero 32,15
.LCPI0_1:
.byte 0 # 0x0
.byte 1 # 0x1
.byte 1 # 0x1
.byte 2 # 0x2
.byte 1 # 0x1
.byte 2 # 0x2
.byte 2 # 0x2
.byte 3 # 0x3
.byte 1 # 0x1
.byte 2 # 0x2
.byte 2 # 0x2
.byte 3 # 0x3
.byte 2 # 0x2
.byte 3 # 0x3
.byte 3 # 0x3
.byte 4 # 0x4
.byte 0 # 0x0
.byte 1 # 0x1
.byte 1 # 0x1
.byte 2 # 0x2
.byte 1 # 0x1
.byte 2 # 0x2
.byte 2 # 0x2
.byte 3 # 0x3
.byte 1 # 0x1
.byte 2 # 0x2
.byte 2 # 0x2
.byte 3 # 0x3
.byte 2 # 0x2
.byte 3 # 0x3
.byte 3 # 0x3
.byte 4 # 0x4
count: # @count
vpxor xmm0, xmm0, xmm0
xor eax, eax
vmovdqa ymm1, ymmword ptr [rip .LCPI0_0] # ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vmovdqa ymm2, ymmword ptr [rip .LCPI0_1] # ymm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
vpxor xmm12, xmm12, xmm12
vpxor xmm4, xmm4, xmm4
vpxor xmm5, xmm5, xmm5
vpxor xmm6, xmm6, xmm6
.LBB0_1: # =>This Inner Loop Header: Depth=1
vmovdqu ymm7, ymmword ptr [rdi 8*rax]
vmovdqu ymm8, ymmword ptr [rdi 8*rax 32]
vmovdqu ymm9, ymmword ptr [rdi 8*rax 64]
vmovdqu ymm10, ymmword ptr [rdi 8*rax 96]
vpand ymm11, ymm7, ymm1
vpshufb ymm11, ymm2, ymm11
vpsrlw ymm7, ymm7, 4
vpand ymm7, ymm7, ymm1
vpshufb ymm7, ymm2, ymm7
vpaddb ymm7, ymm11, ymm7
vpsadbw ymm7, ymm12, ymm7
vpand ymm11, ymm8, ymm1
vpshufb ymm11, ymm2, ymm11
vpsrlw ymm8, ymm8, 4
vpand ymm8, ymm8, ymm1
vpshufb ymm8, ymm2, ymm8
vpaddb ymm8, ymm8, ymm11
vpsadbw ymm8, ymm8, ymm12
vpand ymm11, ymm9, ymm1
vpshufb ymm11, ymm2, ymm11
vpsrlw ymm9, ymm9, 4
vpand ymm9, ymm9, ymm1
vpshufb ymm9, ymm2, ymm9
vpaddb ymm9, ymm9, ymm11
vpsadbw ymm9, ymm9, ymm12
vpand ymm11, ymm10, ymm1
vpshufb ymm11, ymm2, ymm11
vpsrlw ymm10, ymm10, 4
vpand ymm10, ymm10, ymm1
vpshufb ymm10, ymm2, ymm10
vpaddb ymm10, ymm10, ymm11
vpsadbw ymm10, ymm10, ymm12
vextracti128 xmm3, ymm7, 1
vpackusdw xmm3, xmm7, xmm3
vpaddd xmm0, xmm0, xmm3
vextracti128 xmm3, ymm8, 1
vpackusdw xmm3, xmm8, xmm3
vpaddd xmm4, xmm4, xmm3
vextracti128 xmm3, ymm9, 1
vpackusdw xmm3, xmm9, xmm3
vpaddd xmm5, xmm5, xmm3
vextracti128 xmm3, ymm10, 1
vpackusdw xmm3, xmm10, xmm3
vpaddd xmm6, xmm6, xmm3
add rax, 16
cmp rax, 1024
jne .LBB0_1
vpaddd xmm0, xmm4, xmm0
vpaddd xmm0, xmm5, xmm0
vpaddd xmm0, xmm6, xmm0
vpshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
vzeroupper
ret
clang's code is similar to gcc when only -mpopcnt is on, with a bit of unrolling.
count: # @count
xor ecx, ecx
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
popcnt rdx, qword ptr [rdi 8*rcx]
add edx, eax
popcnt rsi, qword ptr [rdi 8*rcx 8]
add esi, edx
popcnt rdx, qword ptr [rdi 8*rcx 16]
popcnt rax, qword ptr [rdi 8*rcx 24]
add edx, esi
add eax, edx
add rcx, 4
cmp rcx, 1024
jne .LBB0_1
ret
According to this document (https://www.agner.org/optimize/instruction_tables.pdf), popcnt is a very cheap instruction on most architectures. Then why is clang generating such a bloat to replace popcnt when I clearly allowed to use it with -mpopcnt? The optimization level was all set to -O3.
Here is a link to godbolt (https://godbolt.org/z/4vWK33a7c).
CodePudding user response:
It's auto-vectorizing as well as unrolling, which is a performance win for large arrays (or would be if clang had less overhead), at least on Intel CPUs where popcnt is 1/clock, so 64 bits per clock. (AMD Zen has 3 or 4/clock popcnt, so with add instructions taking an equal amount of the 4 scalar-integer ALU ports, it could sustain 2/clock uint64_t popcnt load and add.) https://uops.info/
But vpshufb is also 1/clock on Intel (or 2/clock on Ice Lake), and if it's the bottleneck that's 128 bits of popcount work per clock. (Doing table lookups for the low 4 bits of each of 32 bytes.) But it's certainly not going to be that good, with all the extra shuffling it's doing inside the loop. :/
This vectorization loses on Zen1 where the SIMD ALUs are only 256 bits wide, but should be a significant win on Intel, and maybe a win on Zen2 and later.
But looks like clang widens to 32-bit counts inside the inner loop with vpsadbw, so it's not as good as it could be. 1024x uint64_t is 256 __m256i vectors of input data, and clang is unrolling by 4 so the max count in any one element is only 64, which can't overflow.
Clang is unrolling a surprising amount, given how much work it does. The vextracti128 and vpackusdw don't make much sense to me, IDK why it would do that inside the loop. The simple way to vectorize without overflow risk is just vpsadbw -> vpaddq or vpaddd, and it's already using vpsadbw for horizontal byte sums within 8-byte chunks. (A better way is to defer that until just before the byte elements could overflow, so do a few vpaddb. Like in How to count character occurrences using SIMD, although the byte counters are only incremented by 0 or 1 there, rather than 0 .. 8)
See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, especially Wojciech Muła's big-array popcnt functions: https://github.com/WojciechMula/sse-popcount/ - clang is using the same strategy as popcnt_AVX2_lookup but with a much less efficient way to accumulate the results across iterations.
