Optimization of two for loop-CodePudding

How can optimze below code. take example of

x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]

The number of row are very large((4 million) above is just an example.Basically i have to calculate cosine similarity for each row of x_tsvd with svd_tfidf. Is there any way i can optimize it further to speed up.

for i in range(len(x_tsvd)):
    array_=[]
    for j in range (len(svd_tfidf)):
        cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
        array_.append(cosine_similarity_)
    index=np.array(array_).argsort()

CodePudding user response：

Mainly through the following points to accelerate:

numpy.linalg.norm function can calculate the norm along the axis. For 2D arrays, specify axis as 1 to calculate the Euclidean norm for each row.
By broadcasting, the elements between two norm vectors can be multiplied in pairs.
numpy.ndarray.dot method can be used between 2D arrays to calculate inner product in pairs.
numpy.ndarray.argsort method can sort along the axis.

>>> x_tsvd = [[1, 2, 3, 4], [5, 4, 5, 6], [2, 4, 5, 5]]
>>> svd_tfidf = [[11, 3, 4, 5], [4, 4, 6, 7], [6, 6, 3, 5]]
>>> x = np.array(x_tsvd)
>>> y = np.array(svd_tfidf)
>>> norm_prod = np.linalg.norm(x, axis=1)[:, None] * np.linalg.norm(y, axis=1)
>>> similarities = x.dot(y.T) / norm_prod
>>> similarities.argsort(axis=1)
array([[0, 2, 1],
       [0, 2, 1],
       [0, 2, 1]], dtype=int64)

CodePudding user response：

Rather than creating 4Mx4M similarity matrix we can create a neighborhood similarity matrix, where:

K is the k-nearest neighbor similarity of each observations in x_tsvd to svd_tfid
Resulting similarity is 4MxK

Code

import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import BallTree

def nearest_neighbors(x, y, k = 3):
    '''
        For each element of x, finds its k-nearest neighbors in y
        
        Technique:
            Neighbors are based upon cosine similarity
            Converts cosine similarity to a Eucliean distance
            based upon:
                https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier/34145444#34145444
            
        
        Returns:
            - Distances to k nearest neighbors
            - Indices to to k nearest neighbors
    '''
    # Normalize each sample of input
    x = normalize(x_tsvd, axis = 1)  
    y = normalize(svd_tfidf, axis = 1) 
    
    # Form the BallTree
    tree = BallTree(x, metric='euclidean')

    # Get distances and indices to k nearest neighbors
    distances, indices = tree.query(y, k = k)
    
    # map distnaces to similarity
    # transform distances back to similarity i.e. similariy = (2 - d**2)/2
    return (2 - distances*distances)/2, indices

Test

x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]
similarity, indices = nearest_neighbors(x_tsvd, svd_tfidf, k = 2)

print(f'Similarity Matrix\n {similarity}\n')
print(f'Index of Matches of x_tsvd in svd_tfidf\n {indices}')

Output

Similarity Matrix
 [[0.88590616 0.72207119]
 [0.98862307 0.98344042]
 [0.95209915 0.88229057]]

Index of Matches of x_tsvd in svd_tfidf
 [[1 2]
 [1 2]
 [1 2]]

Explanation

We use Balltree to find K nearest neighbor. However:

Similarity is not a distance metric
We use the technique from Using cosine distance with scikit learn KNeighborsClassifier to transform similarity to distances
Create Balltree using Euclidean distance on normalized data
Euclidean distance of each x, y vector using normalized data will be sqrt(2 − 2*x^T y)
With x, y normalized, x^T y is the similarity. Thus, we have transformed similarity to a distance metric
Query Balltree to obtain distances and indices of K nearest neighbors x_tsvd observations in svd_tfidf
Convert distances back to similarity using inverse transformation