I have a time series of vectors: Y = [v1, v2, ..., vn]. At each time t, I want to compute the distance between vector t and the average of the vectors before t. So for example, at t=3 I want to compute the cosine distance between v3 and (v1 v2)/2.
I have a script to do it but wondering if there's any way to do this faster via numpy's convolve feature or something like that?
import numpy
from scipy.spatial.distance import cosine
np.random.seed(10)
# Generate `T` vectors of dimension `vector_dim`
# NOTE: In practice, the vector is a very large column vector!
T = 3
vector_dim = 2
y = [np.random.rand(1, vector_dim)[0] for t in range(T)]
def moving_distance(v):
moving_dists = []
for t in range(len(v)):
if t == 0:
pass
else:
# Create moving average of values up until time t
prior_vals = v[:t]
m_avg = np.add.reduce(prior_vals) / len(prior_vals)
# Now compute distance between this moving average and vector t
moving_dists.append(cosine(m_avg, v[t]))
return moving_dists
d = moving_distance(y)
For this dataset, it should return: [0.3337342770170698, 0.0029993196890111262]
CodePudding user response:
These results indicate that Numba approach is the fastest by far and large, but the vectorized approach is reasonably fast.
When it comes to explicit non-accelerated looping, it is still beneficial to use the custom-defined cos_dist() in place of scipy.spatial.distance.cosine() (see moving_dist_cumsum() vs moving_dist_cumsum2()), while np.cumsum() is reasonably faster than np.add.reduce() but only marginally faster over computing the partial sum. Finally, moving_dist_OP() and moving_dist_simpler() are effectively equivalent (as expected).

