Append pairwise distances to dataframe without missmatch-CodePudding

I have two dataframe df1 and df2. df1 consists of vectors of length 26 and df2 consists of both vectors of length 26 and unique ids. I provide dictionaries at the end of the question.

So, df1 is of the type:

vec1
1623       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
832   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

and df2 is given by:

                      id  \
774    A4060454751516272   
6570   A4060463723916275   
11466  A4060454576016272   
7394   A4050494816516277   

                                                                                                                                         vec2  
774      [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]  
6570     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
11466  [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]  
7394    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

What I want to do is to compute the distances (custom distance) between all vectors in df1 and all vectors in df2, so there are 8 distances. To do so, I have this code:

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances

def custom_distance(x,y):
    d = np.count_nonzero(x)
    dist = np.sqrt(np.sum((x - y)**2/d))
    return dist

Ax =  df1['vec1'].apply(lambda x: np.array(x))
Bx =  df2['vec2'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)

result = pairwise_distances(AA, BB, metric=custom_distance, n_jobs=-1)

which gives:

result = array([[14.91852875, 12.57726123, 16.76491873, 11.1383347 ],
       [12.74197493,  4.58371301, 13.84225144,  9.21726234]])

My Problem

I actually want these distance as a column in the dataframe:

                                                                                                                                      vec1  \
0       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
1       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
2       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
3       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
4  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
5  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
6  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   
7  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]   

                  id  \
0  A4060454751516272   
1  A4060463723916275   
2  A4060454576016272   
3  A4050494816516277   
4  A4060454751516272   
5  A4060463723916275   
6  A4060454576016272   
7  A4050494816516277   

                                                                                                                                     vec2  
0    [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]  
1    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2  [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]  
3   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
4    [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]  
5    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
6  [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]  
7   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

which is the dataframe of all combinations of vectors in df1 and df2. I could append result to that dataframe but I am unsure whether there can be missmatches (e.g., distance x between two vectors is actually the distance between two other vectors.

Greatful for any insight!

Data

df1 = {'vec1': {1623: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.5,
   0.0,
   8.0,
   13.0,
   0.0,
   0.0,
   0.0,
   5.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  832: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.83,
   6.47,
   0.0,
   6.0,
   0.0,
   7.0,
   15.0,
   14.0,
   10.44,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0]}}

and

df2 = {'id': {774: 'A4060454751516272',
  6570: 'A4060463723916275',
  11466: 'A4060454576016272',
  7394: 'A4050494816516277'},
 'vec2': {774: [2.0,
   14.0,
   9.0,
   8.0,
   7.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   6.0,
   2.0,
   14.0],
  6570: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   12.0,
   15.5,
   6.0,
   3.5,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  11466: [0.0,
   9.0,
   7.0,
   0.0,
   10.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   3.0,
   12.0,
   15.0,
   16.0],
  7394: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   11.0,
   0.0,
   4.0,
   6.0,
   15.0,
   15.0,
   7.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0]}}

CodePudding user response：

For clarity let's work with vector indices. Note that df1, df2 are dictionaries, let's create small dataframes with vector indices and ids(where available), and then, importantly, merge them using 'cross' method:

dfi1 = pd.DataFrame(df1).reset_index()[['index']]
dfi2 = pd.DataFrame(df2).reset_index()[['index', 'id']]
dfm = dfi1.merge(dfi2, how = 'cross')

dfm looks like this:

      index_x    index_y  id
--  ---------  ---------  -----------------
 0        832        774  A4060454751516272
 1        832       6570  A4060463723916275
 2        832      11466  A4060454576016272
 3        832       7394  A4050494816516277
 4       1623        774  A4060454751516272
 5       1623       6570  A4060463723916275
 6       1623      11466  A4060454576016272
 7       1623       7394  A4050494816516277

Now we can apply custom_distance to pairs of vectors as indexed by index_x and index_y

dfm['dist'] = dfm.apply(
    lambda r: custom_distance(np.array(df1['vec1'][r['index_x']]) ,
    np.array(df2['vec2'][r['index_y']]) ), axis=1
    )

Now dfm looks like this:

      index_x    index_y  id                     dist
--  ---------  ---------  -----------------  --------
 0        832        774  A4060454751516272  12.742
 1        832       6570  A4060463723916275   4.58371
 2        832      11466  A4060454576016272  13.8423
 3        832       7394  A4050494816516277   9.21726
 4       1623        774  A4060454751516272  14.9185
 5       1623       6570  A4060463723916275  12.5773
 6       1623      11466  A4060454576016272  16.7649
 7       1623       7394  A4050494816516277  11.1383

If you want to put the vectors into the same dataframe you can then do:

dfm['vec1'] = dfm['index_x'].map(df1['vec1'])
dfm['vec2'] = dfm['index_y'].map(df2['vec2'])
dfm.drop(columns = ['index_x', 'index_y'])