Using godbolt.org x86-64 gcc 11.2, This code...
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
...produces this assembly (when compiled with -O3 -mavx)...
mul:
vpmulld xmm0, xmm0, xmm1
ret
However this code...
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
struct {int x,y,z,w;}; // this line is the change
int i[4]; // this one too
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
...produces this assembly (when also compiled with -O3 -mavx)...
mul:
mov QWORD PTR [rsp-40], rdi
mov QWORD PTR [rsp-32], rsi
vmovdqa xmm1, XMMWORD PTR [rsp-40]
mov QWORD PTR [rsp-24], rdx
mov QWORD PTR [rsp-16], rcx
vpmulld xmm0, xmm1, XMMWORD PTR [rsp-24]
vmovdqa XMMWORD PTR [rsp-40], xmm0
mov rax, QWORD PTR [rsp-40]
mov rdx, QWORD PTR [rsp-32]
ret
x86-64 clang 13.0.1 has similar results
So my question is, how can I convince gcc (and/or clang) that these 2 blocks of code should produce the same output?
I've tried __attribute__ ((aligned)), removing the int i[4]; or the struct, applying __attribute__ ((packed)) to the struct, I even gave __attribute__ ((transparent_union)) a go. Whatever magic status __attribute__ ((vector_size (16))) bestows is broken by adding anything to the union.
CodePudding user response:
I should say that I have never worked with this attribute personally and I checked the gcc just now, but from document I saw something that I think will be useful for your problem.
From your code, I can assume that you want to use union to access each int of vector separately. But if it is the only reason, it is not necessary to use int[4] or struct {int x,y,z,w;}; as part of union, because vectors can be used like arrays themselves:
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
} int4;
int4 mul(int4 l, int4 r)
{
int4 ret = (int4){.v=l.v * r.v};
printf("%i %i %i %i", ret.v[0], ret.v[1], ret.v[2], ret.v[3]);
return ret;
}
and the code will be optimized as you like. In addition, if you need byte level access, union with another vector works as you like too:
typedef int v4i __attribute__ ((vector_size (16)));
typedef unsigned char v4b __attribute__ ((vector_size (16)));
struct i4s{int x,y,z,w;};
typedef union {
v4i v;
v4b v2;
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
It seems that union will work with primitive-like types in this case. for example even __m128i works too.
CodePudding user response:
Turns out, they are the same. For some reason the second one includes the populating the xmm? registers from the stack, but if for example one adds a main function...
int main(int argc, char *argv[])
{
// volatile keyword added so they don't get optimised out
volatile int4 x = {.v={1,2,3,4}};
volatile int4 y = {.v={1,2,3,4}};
int4 z = mul(x, y);
return z.v[0];
}
...then the function (or single vpmulld instruction in this case) gets inlined, and different, appropriate stack manipulation gets inserted.
