Multiplying and adding float numbers-CodePudding

Hello I have task to convert some c code to asm and I wonder if what I am thinking makes any sense. First I would convert integers to floats. I would like to get array data to sse register, but here is problem, because I want only 3 not 4 integers, is there way to overcome that? Then I would convert those integers to floats using CVTDQ2PS and I would save those numbers in memory. For the const numbers like 0.393 I would make 3 vectors of floats and then I would do same operation three times so I will think about sepiaRed only. For that I would get my converted integers into sse register and I would multiply those numbers which would give me the result in xmm0 register. Now how can I add them together?

I guess my two questions are: how can I get 3 items from array to sse register, so that I can avoid any problems. And then how can I add three numbers in xmm0 register together.


        tmpGreen = (float)pixels[i   1];
        tmpRed = (float)pixels[i   2];
        tmpBlue = (float)pixels[i];

        sepiaRed = (int)(0.393 * tmpRed   0.769 * tmpGreen   0.189 * tmpBlue); //red
        sepiaGreen = (int)(0.349 * tmpRed   0.686 * tmpGreen   0.168 * tmpBlue); //green
        sepiaBlue = (int)(0.272 * tmpRed   0.534 * tmpGreen   0.131 * tmpBlue); //blue

CodePudding user response：

You can't easily horizontally add 3 numbers together; Fastest way to do horizontal SSE vector sum (or other reduction)

What you can efficiently do is map 4 pixels in parallel, with vectors of 4 reds, 4 greens, and 4 blues. (Which you'd want to load from planar, not interleaved, pixel data. struct of arrays, not array of structs.)

You might be able to get some benefit for doing a single pixel at once, though, if you just load 4 ints with movdqu and use a multiplier of 0.0 for the high element after cvtdq2ps. Then you can do a normal horizontal sum of 4 elements instead of having to adjust it. (Hmm, although doing 3 would let you do the 2nd shuffle in parallel with the first add, instead of after.)

Using SIMD inefficiently loses some of the benefit; see guides in https://stackoverflow.com/tags/sse/info especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ re: how people often try to use one SIMD vector to hold one x,y,z geometry vector, and then find that SIMD didn't help much.

CodePudding user response：

If speed is your concern, you should avoid going through floating point domain and work solely with fixed point (and using 8/16-bit arithmetic).

Even though the (double precision) floating point factors are short in base 10, they are not that short in base 2:

0.393 = 3.93000000000000015987211554602E-1 == 0x3FD926E978D4FDF4
0.168 = 1.68000000000000010436096431476E-1 == 0x3FC5810624DD2F1B
etc.

Given that the original integer r,g,b data is limited to range 0..255 the distant right bits in the factors do not contribute. Thus we may just be truncate or round the binary representation as well.

If 7 bits of coefficient precision is enough, we could come up with the coefficient matrix of

50   98   24 == 0x32 0x62 0x18
45   88   22 == 0x2d 0x58 0x16
35   68   17 == 0x23 0x44 0x11

7-bits, because the fastest way to compute small dot products in SSE is _mm_maddubs_epi16, which can multiply uint8_t RGB with 8-bit signed (or 7-bit unsigned) coefficients.

Then we need to arrange the input and the coefficient matrices properly.

Option 1: interleaved

R0G0B0R1G1B1R2G2B2R3G3B3R4G4B4R5G5B5R6G6B6...

Option 2: Planar:

R0R1R2R3...     G0G1G2G3...       B0B1B2B3....

In either way the target is to reshuffle the data to

xmm0 = R0G0R1G1R2G2R3G3R4G4R5G5R6G6R7G7
xmm1 = B0xxB1xxB2xxB3xxB4xxB5xxB6xxB7xx

rg0  = 326232623262...
b0.  = 180018001800...

r_new_0 = _mm_maddubs_epi16(xmm0, rg0);
g_new_0 = _mm_maddubs_epi16(xmm0, rg1);
b_new_0 = _mm_maddubs_epi16(xmm0, rg2);

r_new_1 = _mm_maddubs_epi16(xmm1, b0);
g_new_1 = _mm_maddubs_epi16(xmm1, b1);
b_new_1 = _mm_maddubs_epi16(xmm1, b2);

r_new_0 = _mm_add_epi16(r_new_0, r_new_1);
g_new_0 = _mm_add_epi16(g_new_0, g_new_1);
b_new_0 = _mm_add_epi16(b_new_0, b_new_1);

Then we need to shift right by 7 and convert to uint8_t. This conversion needs saturation, since the sums of the coefficients in each column are larger than 128.

 r_new_0 = _mm_srli_epi16(r_new_0, 7);
 r_new_0 = _mm_packus_epi16(r_new_0, r_new_0);
 ... and same for g_new_0, b_new_0

This final step shows a very small inefficiency, since one half of the register capacity is lost; consuming 24 bytes of input we produced 8 8 8 outputs.

It's probably anyway better to start working with 16 16 16 input bytes, which leads to 12 multiplication, the first being complete well in time for the additions.