Efficient code to load AVX vectors for 1D convolution kernel of length 8-CodePudding

An implementation of a 1D convolution operation will often need to load a vectors of data that sequentially step through a buffer of data offset by one element each iteration.

For example, consider a buffer of input data X[0], X[1], ..., X[n-1], where n is greater than twice the kernel length. If the convolution length is three, and we can fit eight elements in each vector, we might first want a vector with X[0], X[1], ..., X[7], then the next with X[1], X[2], ..., X[8] and the last with X[2], X[3], ..., X[9].

Consider the case where the kernel length as well as the vector length is 8. We must load eight vectors, that might look sequentially like this:

{  0   1   2   3   4   5   6   7  }
{  1   2   3   4   5   6   7   8  }
{  2   3   4   5   6   7   8   9  }
{  3   4   5   6   7   8   9  10  }
{  4   5   6   7   8   9  10  11  }
{  5   6   7   8   9  10  11  12  }
{  6   7   8   9  10  11  12  13  }
{  7   8   9  10  11  12  13  14  }

By reducing this sequence vertically, we could produce a running mean or sum. I.e., the sum of these vectors will have the sum of the first 8 elements in it's first position.

Consider that the order of the elements in the column does not matter. Any permutation of the elements in each column will still produce the same result. For a convolution, this permutation can be accounted for by altering the order of the constants used in the kernel.

Is there a faster way to load these vectors that takes advantage of this? Consider as a baseline the simple sequence of unaligned loads:

// Any sort of sliding window function, i.e. running mean, running max, convolution, etc.
void sliding_window(const float* input, unsigned length)
{
    for (unsigned i = 0; i < length - 7; i  = 8) {
        for (unsigned j = 0; i < 8; j  ) {
            __m256 v = _mm256_loadu_ps(input[i   j]);
            // reduction operation on v (e.g. max or fmadd) goes here
        }
    }
    // handle tail here
}

CodePudding user response：

The best I've been able to come up with is this sequence:

{  0   1   2   3   4   5   6   7  }
{  1   2   3   8   5   6   7  12  }
{  2   3   8   9   6   7  12  13  }
{  3   8   9  10   7  12  13  14  }
{  4   5   6   7   8   9  10  11  }
{  5   6   7   4   9  10  11   8  }
{  6   7   4   5  10  11   8   9  }
{  7   4   5   6  11   8   9  10  }

Each column contains the necessary elements. The first column contains 0 - 7, the next 1 - 8, then 2 - 9, etc.

This can be produced with the following sequence of operations:

void sliding_window(const float* input, unsigned length)
{
    __m256 a = _mm256_load_ps(input);   
    for (unsigned i = 8; i < length - 7; i  = 8) {
        __m256 b = _mm256_load_ps(input   i);

        __m256i ai = _mm256_castps_si256(a);  // not part of sequence
        __m256i bi = _mm256_castps_si256(b);  // just for code reduction

        // a is the first vector, these are remaining 7
        __m256 j1 = _mm256_castsi256_ps(_mm256_alignr_epi8(bi, ai, 4));
        // Reduction operation (add, fmadd, max, etc.) between a and j1 goes here
        __m256 j2 = _mm256_castsi256_ps(_mm256_alignr_epi8(bi, ai, 8));
        // Reduction with j2 goes here, and so on after each value
        __m256 j3 = _mm256_castsi256_ps(_mm256_alignr_epi8(bi, ai, 12));
        __m256 r0 = _mm256_permute2f128_ps(a, b, 0x21);
        a = b;  // Register with "b" isn't needed anymore
        __m256 r1 = _mm256_permute_ps(r0, 0x39);
        __m256 r2 = _mm256_permute_ps(r0, 0x4e);
        __m256 r3 = _mm256_permute_ps(r0, 0x93);
        // Final reduction with r3 to produce result
    }
    // handle tail here
}

On Zen3, I benchmark this as about 10% faster than the sequence of unaligned loads.

CodePudding user response：

First of all, you should note that if your convolution is separable, this is very often worth doing. Simple example:

res[i] = x[i] x[i 1] x[i 2] x[i 3] x[i 4] x[i 5] x[i 6] x[i 7];

This can be done by convoluting with [1 1] * [1 0 1] * [1 0 0 0 1] in three steps, for example like so:

void sliding_window(float* output, const float* input, size_t length)
{
    // Nomenclature
    // aX input at i X
    // bX convolution with [1 1] starting at i X
    // cX convolution with [1 1] * [1 0 1] starting at i X
    // dX convolution with [1 1] * [1 0 1] * [1 0 0 0 1] starting at i X

    __m256 a0 = _mm256_load_ps(input), a8 = _mm256_load_ps(input   8);
    __m256 b0 = _mm256_add_ps(a0, _mm256_loadu_ps(input 1)), b8 = _mm256_add_ps(a8, _mm256_loadu_ps(input 9));
    __m256 b4 = _mm256_permute2f128_ps(b0, b8, 1 16*2);
    __m256 b2 = _mm256_shuffle_ps(b0, b4, 2 3*4 0*16 1*64);
    __m256 c0 = _mm256_add_ps(b0, b2);

    for (unsigned i = 0; i < length - 25; i  = 8) {
        // Convolute input with [1 1]
        __m256 a16 = _mm256_load_ps( input   i   16);
        __m256 a17 = _mm256_loadu_ps(input   i   17);
        __m256 b16 = _mm256_add_ps(a16, a17);

        // Convolute first convolution with [1 0 1]
        __m256 b12 = _mm256_permute2f128_ps(b8, b16, 1 16*2);
        __m256 b10 = _mm256_shuffle_ps(b8, b12, 2 3*4 0*16 1*64);
        __m256 c8 = _mm256_add_ps(b8, b10);

        // Convolute second convolution with [1 0 0 0 1]
        __m256 c4 = _mm256_permute2f128_ps(c0, c8, 1 16*2);
        __m256 d0 = _mm256_add_ps(c0, c4);

        // Store result
        _mm256_store_ps(output   i, d0);

        // rename registers for next iteration:
        b8 = b16;
        c0 = c8;
    }
    // handle tail here ...
}

You can of course replace addps by maxps. Godbolt-Demo: https://godbolt.org/z/W9K9o943o

Overall, this takes 1 aligned 1 unaligned load, 3 shuffles, 3 additions and 1 store for 8 elements (actually only using AVX1). On Intel CPUs with only 1 shuffle per cycle this may actually just be slightly faster than a naïve 8-load, 7-addition implementation (I did not benchmark this). On Zen3 I'm not sure about the actual cost of loading unaligned data.

If you have a non-trivial kernel it is probably hard to determine if it is separable, though.