The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.
I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:
A1 A2 ... A40
0 xx xx ... xx
1 xx xx ... xx
2 xx xx ... xx
... ...
399,999 xx xx ... xx
I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data.
So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.
I then use dis = numpy.linalg.norm(a-b) to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.
What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0 to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.
result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
dis = numpy.linalg.norm(X_data[0]-X_data[i])
result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value
The result look like this, it shows the 5 most similar customers of customer0:
index1 index2 distance
0 0 206391 0.004
1 0 314234 0.006
2 0 89284 0.007
3 0 124826 0.012
4 0 234513 0.013
So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0, not to mention all the customers. What should I do to get it faster? Any ideas?
CodePudding user response:
Use efficient implementation of scikit:
sklearn.metrics.pairwise_distances(X)
which will returns; a distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X.
Then you can use np.argpartition(D, k) to access the top k indices.
Or simply based on the scikit docs and @bb1 comment:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
neigh.fit(samples)
neigh.kneighbors(samples, 2, return_distance=False)[:,1]
CodePudding user response:
Using numpy vectorized operations you can avoid both the for loops. I will take a shorter example, which you can easily extrapolate. Suppose I have 3 data points(400,000 in your case) each 4 dimensional (40 dimensional in your case).
a = np.array([2,4,5,6])
b = np.array([3,5,6,7])
c = np.array([4,6,7,8])
d = np.vstack([a,b,c])
d.shape
(3,4)
now calculate the outer difference of 3 vectors a,b,c in d
[[a, b, c]]
[a, a-a a-b a-c
b, b-a b-b b-c
c, c-a c-b c-a
]
imagine all of the vectors extending into the 3rd dimension (perpendicular to the screen). Euclidean distance between 2 vectors x and y is norm(x-y). So what we want is norm of this matrix along axis = 2
This matrix can be generated by broadcasting d with reshaped version of d with shape (3,1,4)
v = d - d.reshape((3,1,4))
v
array([[[ 0, 0, 0, 0],
[ 1, 1, 1, 1],
[ 2, 2, 2, 2]],
[[-1, -1, -1, -1],
[ 0, 0, 0, 0],
[ 1, 1, 1, 1]],
[[-2, -2, -2, -2],
[-1, -1, -1, -1],
[ 0, 0, 0, 0]]])
Notice the rows of 0s in 3 matrices. Now we want to find the norm of this matrix along axis = 2
np.linalg.norm(v,axis=2)
array([[0., 2., 4.],
[2., 0., 2.],
[4., 2., 0.]])
Now all we have to do is find n largest numbers along axis = 1. There are many methods to do that, for which please refer to this question, depending on whether you want just the 5 nearest values as well as the indices.
