I have two dataframe df1 and df2. df1 consists of vectors of length 26 and df2 consists of both vectors of length 26 and unique ids. I provide dictionaries at the end of the question.
So, df1 is of the type:
vec1
1623 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
832 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
and df2 is given by:
id \
774 A4060454751516272
6570 A4060463723916275
11466 A4060454576016272
7394 A4050494816516277
vec2
774 [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]
6570 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
11466 [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]
7394 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
What I want to do is to compute the distances (custom distance) between all vectors in df1 and all vectors in df2, so there are 8 distances. To do so, I have this code:
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
def custom_distance(x,y):
d = np.count_nonzero(x)
dist = np.sqrt(np.sum((x - y)**2/d))
return dist
Ax = df1['vec1'].apply(lambda x: np.array(x))
Bx = df2['vec2'].apply(lambda x: np.array(x))
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)
result = pairwise_distances(AA, BB, metric=custom_distance, n_jobs=-1)
which gives:
result = array([[14.91852875, 12.57726123, 16.76491873, 11.1383347 ],
[12.74197493, 4.58371301, 13.84225144, 9.21726234]])
My Problem
I actually want these distance as a column in the dataframe:
vec1 \
0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
3 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 8.0, 13.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
5 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
6 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
7 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.83, 6.47, 0.0, 6.0, 0.0, 7.0, 15.0, 14.0, 10.44, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
id \
0 A4060454751516272
1 A4060463723916275
2 A4060454576016272
3 A4050494816516277
4 A4060454751516272
5 A4060463723916275
6 A4060454576016272
7 A4050494816516277
vec2
0 [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]
3 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4 [2.0, 14.0, 9.0, 8.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 2.0, 14.0]
5 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
6 [0.0, 9.0, 7.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 15.0, 16.0]
7 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 0.0, 4.0, 6.0, 15.0, 15.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
which is the dataframe of all combinations of vectors in df1 and df2. I could append result to that dataframe but I am unsure whether there can be missmatches (e.g., distance x between two vectors is actually the distance between two other vectors.
Greatful for any insight!
Data
df1 = {'vec1': {1623: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.5,
0.0,
8.0,
13.0,
0.0,
0.0,
0.0,
5.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
832: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.83,
6.47,
0.0,
6.0,
0.0,
7.0,
15.0,
14.0,
10.44,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]}}
and
df2 = {'id': {774: 'A4060454751516272',
6570: 'A4060463723916275',
11466: 'A4060454576016272',
7394: 'A4050494816516277'},
'vec2': {774: [2.0,
14.0,
9.0,
8.0,
7.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
2.0,
14.0],
6570: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
12.0,
15.5,
6.0,
3.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
11466: [0.0,
9.0,
7.0,
0.0,
10.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
12.0,
15.0,
16.0],
7394: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
11.0,
0.0,
4.0,
6.0,
15.0,
15.0,
7.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]}}
CodePudding user response:
For clarity let's work with vector indices. Note that df1, df2 are dictionaries, let's create small dataframes with vector indices and ids(where available), and then, importantly, merge them using 'cross' method:
dfi1 = pd.DataFrame(df1).reset_index()[['index']]
dfi2 = pd.DataFrame(df2).reset_index()[['index', 'id']]
dfm = dfi1.merge(dfi2, how = 'cross')
dfm looks like this:
index_x index_y id
-- --------- --------- -----------------
0 832 774 A4060454751516272
1 832 6570 A4060463723916275
2 832 11466 A4060454576016272
3 832 7394 A4050494816516277
4 1623 774 A4060454751516272
5 1623 6570 A4060463723916275
6 1623 11466 A4060454576016272
7 1623 7394 A4050494816516277
Now we can apply custom_distance to pairs of vectors as indexed by index_x and index_y
dfm['dist'] = dfm.apply(
lambda r: custom_distance(np.array(df1['vec1'][r['index_x']]) ,
np.array(df2['vec2'][r['index_y']]) ), axis=1
)
Now dfm looks like this:
index_x index_y id dist
-- --------- --------- ----------------- --------
0 832 774 A4060454751516272 12.742
1 832 6570 A4060463723916275 4.58371
2 832 11466 A4060454576016272 13.8423
3 832 7394 A4050494816516277 9.21726
4 1623 774 A4060454751516272 14.9185
5 1623 6570 A4060463723916275 12.5773
6 1623 11466 A4060454576016272 16.7649
7 1623 7394 A4050494816516277 11.1383
If you want to put the vectors into the same dataframe you can then do:
dfm['vec1'] = dfm['index_x'].map(df1['vec1'])
dfm['vec2'] = dfm['index_y'].map(df2['vec2'])
dfm.drop(columns = ['index_x', 'index_y'])
