How can optimze below code. take example of
x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]
The number of row are very large((4 million) above is just an example.Basically i have to calculate cosine similarity for each row of x_tsvd with svd_tfidf. Is there any way i can optimize it further to speed up.
for i in range(len(x_tsvd)):
array_=[]
for j in range (len(svd_tfidf)):
cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
array_.append(cosine_similarity_)
index=np.array(array_).argsort()
CodePudding user response:
Mainly through the following points to accelerate:
numpy.linalg.normfunction can calculate the norm along the axis. For 2D arrays, specifyaxisas 1 to calculate the Euclidean norm for each row.- By broadcasting, the elements between two norm vectors can be multiplied in pairs.
numpy.ndarray.dotmethod can be used between 2D arrays to calculate inner product in pairs.numpy.ndarray.argsortmethod can sort along the axis.
>>> x_tsvd = [[1, 2, 3, 4], [5, 4, 5, 6], [2, 4, 5, 5]]
>>> svd_tfidf = [[11, 3, 4, 5], [4, 4, 6, 7], [6, 6, 3, 5]]
>>> x = np.array(x_tsvd)
>>> y = np.array(svd_tfidf)
>>> norm_prod = np.linalg.norm(x, axis=1)[:, None] * np.linalg.norm(y, axis=1)
>>> similarities = x.dot(y.T) / norm_prod
>>> similarities.argsort(axis=1)
array([[0, 2, 1],
[0, 2, 1],
[0, 2, 1]], dtype=int64)
CodePudding user response:
Rather than creating 4Mx4M similarity matrix we can create a neighborhood similarity matrix, where:
- K is the k-nearest neighbor similarity of each observations in x_tsvd to svd_tfid
- Resulting similarity is 4MxK
Code
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import BallTree
def nearest_neighbors(x, y, k = 3):
'''
For each element of x, finds its k-nearest neighbors in y
Technique:
Neighbors are based upon cosine similarity
Converts cosine similarity to a Eucliean distance
based upon:
https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier/34145444#34145444
Returns:
- Distances to k nearest neighbors
- Indices to to k nearest neighbors
'''
# Normalize each sample of input
x = normalize(x_tsvd, axis = 1)
y = normalize(svd_tfidf, axis = 1)
# Form the BallTree
tree = BallTree(x, metric='euclidean')
# Get distances and indices to k nearest neighbors
distances, indices = tree.query(y, k = k)
# map distnaces to similarity
# transform distances back to similarity i.e. similariy = (2 - d**2)/2
return (2 - distances*distances)/2, indices
Test
x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]
similarity, indices = nearest_neighbors(x_tsvd, svd_tfidf, k = 2)
print(f'Similarity Matrix\n {similarity}\n')
print(f'Index of Matches of x_tsvd in svd_tfidf\n {indices}')
Output
Similarity Matrix
[[0.88590616 0.72207119]
[0.98862307 0.98344042]
[0.95209915 0.88229057]]
Index of Matches of x_tsvd in svd_tfidf
[[1 2]
[1 2]
[1 2]]
Explanation
We use Balltree to find K nearest neighbor. However:
- Similarity is not a distance metric
- We use the technique from Using cosine distance with scikit learn KNeighborsClassifier to transform similarity to distances
- Create Balltree using Euclidean distance on normalized data
- Euclidean distance of each x, y vector using normalized data will be sqrt(2 − 2*x^T y)
- With x, y normalized, x^T y is the similarity. Thus, we have transformed similarity to a distance metric
- Query Balltree to obtain distances and indices of K nearest neighbors x_tsvd observations in svd_tfidf
- Convert distances back to similarity using inverse transformation
