Why is this C code only marginally faster than Python?-CodePudding

I converted some python code into c hoping for some performance benefits but found that the c implementation was only marginally faster. The code I'm converting is an sos filter from the scipy library. My python test is;

a_wgt_coefs = [[0.2343006, 0.46860119, 0.2343006, 1., -0.22455845, 0.01260662],
               [1., -2., 1., 1., -1.89387049, 0.89515977, ],
               [1., -2., 1., 1., -1.99461446, 0.99462171]]

# Define input signal
fs = 48000
T = 1.0
N = int(fs * T)
t = np.linspace(0, T, N)

ip_signal = np.sin(2 * np.pi * 440 * t)

# Filter signal
num_runs = 1000


def main():
    durations = 0
    for n in range(num_runs):
        t_start = perf_counter()
        op_signal = signal.sosfilt(a_wgt_coefs, ip_signal)
        t_end = perf_counter()
        durations = durations   (t_end - t_start)
    avg_duration = durations / num_runs
    print(f'Average execution time = {avg_duration} seconds')


if __name__ == '__main__':
    main()

The C code is a line for line translation of the scipy _sosfilt function, which I've imlemented as so;

inline void sosfilt_cls_4(float sos[3][6], float x[SAMPLE_RATE]) {
    float x_n = 0, x_c = 0;
    float zi[3][2] = { 0 };
    // iterate over every i sample section
    for (size_t i = 0; i < SAMPLE_RATE;   i)
    {
        x_c = x[i];
        // iterate over every j section sample
        for (size_t j = 0; j < 3;   j)
        {
            float* section = sos[j];
            float* zi_n = zi[j];
            x_n = section[0] * x_c   zi_n[0];
            zi_n[0] = section[1] * x_c - section[4] * x_n   zi_n[1];
            zi_n[1] = section[2] * x_c - section[5] * x_n;
            x_c = x_n;
        }
        x[i] = x_c;
    }
    return;
}

I benchmared this using std::chrono;

float input_array[SAMPLE_RATE];
float sum = 0;

float sos_fs_48k_array_flt[3][6] = {{0.2343006f,   0.46860119f,  0.2343006f,   1.f, -0.22455845f,  0.01260662f},
 { 1.f, -2.f,          1.f,          1.f, -1.89387049f,  0.89515977f,},
 { 1.f, -2.f,          1.f,          1.f, -1.99461446f,  0.99462171f} };

int main()
{
    auto lin = linspace(0.0, 1.0, double(samples));

    std::cout << "Testing\n\n";
    for (int x = 0; x < runs; x  ) {
        for (int x = 0; x < samples; x  ) {
            input_col_vector(x, 0) = sin(2 * M_PI * 440 * lin.coeff(x, 0));
            input_array[x] = sin(2 * M_PI * 440 * lin.coeff(x, 0));
        }
        auto start = std::chrono::steady_clock::now();
        sosfilt_cls_4(sos_fs_48k_array_flt, input_array);
        auto stop = std::chrono::steady_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
        sum  = elapsed.count();

    }
    std::cout << "Average time:\t" << (sum / runs)/ 1e6 << std::endl;
    std::cout << "End.\n";
    return 0;
}

The python code runs in 0.0002448 seconds averaged over 1000 runs. The c code by comparison runs in 0.0002336 seconds averaged over 1000 runs.

I've set my compiler options in visual studio to prioritise speed over space, with the Ox flag set. I'm also using the fp:fast floating point model and also AVX2 enhanced instruction sets.

Other things I've done to improve speed is allocating all input data in the stack, using floats instead of doubles but they haven't really made a difference.

Oddly enough, over a single run, the c code is way faster, running in 0.0002333 compared to python which takes 0.00045

update: I am now getting 0.00012 seconds

I put the function in the main.cpp file and started in a fresh project with default optimizations.;

template<typename T>
inline void sosfilt(T sos[3][6], T x[SAMPLE_RATE]) {
    T x_n = 0, x_c = 0;
    T zi[3][2] = { 0 };
    // iterate over every i sample section
    for (size_t i = 0; i < SAMPLE_RATE;   i)
    {
        x_c = x[i];
        // iterate over every j section sample
        for (size_t j = 0; j < 3;   j)
        {
            T* section = sos[j];
            T* zi_n = zi[j];
            x_n = section[0] * x_c   zi_n[0];
            zi_n[0] = section[1] * x_c - section[4] * x_n   zi_n[1];
            zi_n[1] = section[2] * x_c - section[5] * x_n;
            x_c = x_n;
        }
        x[i] = x_c;
    }
    return;
}

CodePudding user response：

The function you are replicating is implemented in Cython, see https://github.com/scipy/scipy/blob/v1.7.1/scipy/signal/_sosfilt.pyx.

It is also specialized for the float case. So if it is written carefully (which I assume it is), it probably never has to call back into Python and is effectively pure C code, compiled as C, with a bit of different syntax, and effectively equivalent to your code.

CodePudding user response：

scipy/numpy basically call into optimized numerics libraries which are written in C/Fortran/C .

The only additional overhead is converting to/from python types, which is pretty quick with the cpython api.