Calculating Euclidean distance with a lot of pairs of points is too slow in Python-CodePudding

The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.

I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:

          A1 A2 ... A40
0         xx xx ... xx
1         xx xx ... xx
2         xx xx ... xx
...       ...
399,999   xx xx ... xx

I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data.

So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.

I then use dis = numpy.linalg.norm(a-b) to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.

What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0 to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.

result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
    dis = numpy.linalg.norm(X_data[0]-X_data[i])
    result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value

The result look like this, it shows the 5 most similar customers of customer0:

  index1 index2 distance
0   0    206391  0.004
1   0    314234  0.006
2   0    89284   0.007
3   0    124826  0.012
4   0    234513  0.013

So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0, not to mention all the customers. What should I do to get it faster? Any ideas?

CodePudding user response：

Use efficient implementation of scikit:

sklearn.metrics.pairwise_distances(X)

which will returns; a distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X.

Then you can use np.argpartition(D, k) to access the top k indices. Or simply based on the scikit docs and @bb1 comment:

import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
neigh.fit(samples)
neigh.kneighbors(samples, 2, return_distance=False)[:,1]

CodePudding user response：

Using numpy vectorized operations you can avoid both the for loops. I will take a shorter example, which you can easily extrapolate. Suppose I have 3 data points(400,000 in your case) each 4 dimensional (40 dimensional in your case).

a = np.array([2,4,5,6])
b = np.array([3,5,6,7])
c = np.array([4,6,7,8])
d = np.vstack([a,b,c])
d.shape
(3,4)

now calculate the outer difference of 3 vectors a,b,c in d

  [[a,     b,     c]]   

[a,  a-a   a-b    a-c

 b,  b-a   b-b    b-c

 c,  c-a   c-b    c-a
]

imagine all of the vectors extending into the 3rd dimension (perpendicular to the screen). Euclidean distance between 2 vectors x and y is norm(x-y). So what we want is norm of this matrix along axis = 2

This matrix can be generated by broadcasting d with reshaped version of d with shape (3,1,4)

v = d - d.reshape((3,1,4))
v

array([[[ 0,  0,  0,  0],
     [ 1,  1,  1,  1],
     [ 2,  2,  2,  2]],

   [[-1, -1, -1, -1],
    [ 0,  0,  0,  0],
    [ 1,  1,  1,  1]],

   [[-2, -2, -2, -2],
    [-1, -1, -1, -1],
    [ 0,  0,  0,  0]]])

Notice the rows of 0s in 3 matrices. Now we want to find the norm of this matrix along axis = 2

np.linalg.norm(v,axis=2)

array([[0., 2., 4.],
   [2., 0., 2.],
   [4., 2., 0.]])

Now all we have to do is find n largest numbers along axis = 1. There are many methods to do that, for which please refer to this question, depending on whether you want just the 5 nearest values as well as the indices.