Select closest row by coordinates between 2 tab in pandas-CodePudding

I have a dataframe such as :

Tab1

    COL1  start1  end1 COL3
    A     140     180  LO
    B     1000    2000 LA

and another dataframe such as :

Tab2

COL1   start2 end2 COL4
A      3     7     PM
A      10    90    UI
A      80    123   RT
B      0     30    ZA
B      2100  2400  AZ

And I would like to create a file such as :

COL1  start1  end1  start2 end2 COL3 COL4 Distance
A     140     180   80     123  LO   RT   17
B     1000    2000  2100   2400 LA   AZ   100

The idea here is for each COL1 in Tab1, I want to look at the same in Tab2, and look the closest start2-end2 position compared to start1-end1 position.

So for instance in COL1-A:

Tab1

COL1  start1  end1 COL3
A     140     180  LO

I have 3 row on Tab2:

COL1   start2 end2 COL4
A      3     7     PM
A      10    90    UI
A      80    123   RT

As you can see the closest start2-end2 to 140-180 is 80-123 since : 140-123 = 17 , 140-90= 50 and 140-7 = 133. (17 is the shortest and then the closest). So I add it in a New_tab:

COL1  start1  end1  start2 end2 COL3 COL4 Distance
A     140     180   80     123  LO   RT   17

Here all coordinates were less than start1-end1, but I can have also start2-end2 > start1-end1 such as in example B:

COL1  start1  end1 COL3
B     1000    2000 LA

here in Tab2 I have 2 candidates:

COL1   start2 end2 COL4
B      0     30    ZA
B      2100  2400  AZ

Where the closest is 2100-2400 since

2100-2000 = 100 and 1000-30 = 970 (100 in the shortest, then the closest). So I right in the New_tab:

COL1  start1  end1  start2 end2 COL3 COL4 Distance
A     140     180   80     123  LO   RT   17
B     1000    2000  2100   2400 LA   AZ   100

So far I succeed to do that using that uggly long code :

import pandas as pd 

tab1=pd.read_csv("tab1.txt",sep=";")
tab2=pd.read_csv("tab2.txt",sep=";")

New_tab = pd.DataFrame(columns=['COL1',  'start1','end1','start2','end2','COL3','COL4','Distance'])

for letters in tab1['COL1'].unique():
    #Extract subtab for each species 
    #Loop over subBUSCO_table coordinates 
    for index, row in tab1.iterrows():
                subtab2=tab2.loc[tab2['COL1'].str.contains(row['COL1'])]
                if len(subtab2)>0:
                    COL1= row['COL1']
                    COL3= row['COL3']
                    tab1_start = row['start1']
                    tab1_end = row['end1']
                    #Left windows 
                    if len(subtab2.loc[subtab2['end2'].le(tab1_start) & subtab2['start2'].lt(tab1_start)]) >0 :
                        subsubtab2=subtab2.loc[subtab2['end2'].le(tab1_start) & subtab2['start2'].lt(tab1_start)]
                        subsubtab2=subsubtab2.sort_values(by='end2', ascending=False)
                        Distance=tab1_start - subsubtab2.iloc[0]['end2']
                        tab2_start=subsubtab2.iloc[0]['start2']
                        COL4=subsubtab2.iloc[0]['COL4']
                        tab2_end=subsubtab2.iloc[0]['end2']
                        New_tab = New_tab.append({'COL1':COL1,'start1': tab1_start, 'end1':tab1_end,'start2': tab2_start, 'end2':tab2_end,'COL3':COL3, 'COL4':COL4, 'Distance':Distance},ignore_index=True)
                    #Right windows 
                    if len(subtab2.loc[subtab2['end2'].gt(tab1_end) & subtab2['start2'].ge(tab1_end)]) >0 :
                        subsubtab2=subtab2.loc[subtab2['end2'].gt(tab1_end) & subtab2['start2'].ge(tab1_end)]
                        subsubtab2=subsubtab2.sort_values(by='end2', ascending=False)
                        Distance= subsubtab2.iloc[0]['start2'] - tab1_end
                        tab2_start=subsubtab2.iloc[0]['start2']
                        COL4=subsubtab2.iloc[0]['COL4']
                        tab2_end=subsubtab2.iloc[0]['end2']
                        print("right")
                        New_tab = New_tab.append({'COL1':COL1,'start1': tab1_start, 'end1':tab1_end,'start2': tab2_start, 'end2':tab2_end,'COL3':COL3, 'COL4':COL4, 'Distance':Distance},ignore_index=True)

#Sort df Distance 
New_tab=New_tab.sort_values(by='Distance', ascending=True)
#Keep shortest per COL1 
New_tab = New_tab.drop_duplicates(subset=['COL1'], keep='first')

New_tab
  COL1 start1  end1 start2  end2 COL3 COL4 Distance
0    A    140   180     80   123   LO   RT       17
2    B   1000  2000   2100  2400  LA    AZ      100

Here are the two table in dict format if it can helps:

>>> tab1.to_dict()
{'COL1': {0: 'A', 1: 'B'}, 'start1': {0: 140, 1: 1000}, 'end1': {0: 180, 1: 2000}, 'COL3': {0: 'LO', 1: 'LA '}}
>>> tab2.to_dict()
{'COL1': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'}, 'start2': {0: 3, 1: 10, 2: 80, 3: 0, 4: 2100}, 'end2': {0: 7, 1: 90, 2: 123, 3: 30, 4: 2400}, 'COL4': {0: 'PM', 1: 'UI', 2: 'RT', 3: 'ZA', 4: 'AZ'}}

CodePudding user response：

Here's one way:

(i) Merge the 2 DataFrames on 'COL1'

(ii) Find the absolute difference between (end1-start1) and (end2-start2) and assign it to new column 'diff'

(iii) Find Distance using np.where and assign it to new column 'Distance'

(iv) Transform the minimum diff and use it as a boolean mask to filter the correct rows

new_tab = tab1.merge(tab2, on='COL1')
new_tab['diff'] = ((new_tab['end1'] - new_tab['start1']) - (new_tab['end2'] - new_tab['start2'])).abs()
first = new_tab['end2'] - new_tab['start1']
second = new_tab['end1'] - new_tab['start2']
new_tab['Distance'] = np.abs(np.where(first>second, second, first))
out = new_tab[new_tab['diff'] == new_tab.groupby('COL1')['diff'].transform('min')].drop('diff', axis=1)

Output:

  COL1  start1  end1 COL3  start2  end2 COL4  Distance
2    A     140   180   LO      80   123   RT        17
4    B    1000  2000  LA     2100  2400   AZ       100