Python - haversine distance between 2 dataframes and allocate the corresponding Limit to loc with mi-CodePudding

I have 2 dataframes:

df_exposure (>300k rows)

    ID  Limit   Lat Lon
0   1   49  21.066107   121.930200
1   2   49  20.932773   121.913533
2   3   49  20.932773   121.921867
3   4   49  20.924440   121.930200
4   5   49  20.899440   121.905200

from df_exposure i extract

lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']

df (3k rows):


Lat Lon Limit
0   4.125   116.125 0.0
1   4.375   116.125 0.0
2   4.625   116.125 0.0
3   4.875   116.125 0.0
4   5.125   116.125 0.0

And this is the haversine function:

def haversine(lat2, lon2, lat1, lon1):
    
    lat1_ = lat1 * np.pi / 180
    lat2_ = lat2 * np.pi / 180
    lon1_ = lon1 * np.pi / 180
    lon2_ = lon2 * np.pi / 180
     
    a = (np.sin((lat2_ - lat1_) / 2)**2)   (np.sin((lon2_ - lon1_) / 2)**2) * np.cos(lat1_) * np.cos(lat2_)
    dist = 2 * 6371 * np.arcsin(np.sqrt(a))

    return dist

Essentially, the df is a subset of df_exposure with bigger grid size and I would like to get the get the distance between all locations in df against each location (row) of lat long in df_exposure to find the minimum distance and allocate the Limit in the corresponding df_exposure row to location in df with smallest distance and this will be iterative on each location in df_exposure until all locations are calculated.

This is how currently it is done but it takes a very long time because of the size of df_exposure (>300k rows)

for i in range(len(lat_loc)):

    r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
    dist = r.min() # find minimum distance
    df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit']   df_exposure.loc[i, 'Limit']

I would appreciate some suggestion to improve the current code. Thank you.

CodePudding user response：

You can use sklearn.neighbors.DistanceMetric for the haversine distance,

from sklearn.neighbors import DistanceMetric
distance = DistanceMetric.get_metric('haversine')

lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']

lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']

(6371*distance.pairwise((np.array([lat1,lon1])* np.pi / 180).T, 
                    (np.array([lat2,lon2])* np.pi / 180).T).min(1))

CodePudding user response：

Let's go in order. I have created dataframes with the specified dimensions. Here is the runtime of your implementation:

import time

import numpy as np
import pandas as pd


EXPOSURE_SIZE = 300_000
DF_SIZE = 3000

df_exposure = pd.DataFrame({'Limit': np.random.randint(0, 1000, size=(EXPOSURE_SIZE,)),
                            'Lat': np.random.uniform(-10, 10, size=EXPOSURE_SIZE),
                            'Lon': np.random.uniform(-10, 10, size=EXPOSURE_SIZE)})

df = pd.DataFrame(
    {'Limit': np.random.randint(0, 1000, size=(DF_SIZE,)),
     'Lat': np.random.uniform(-10, 10, size=DF_SIZE),
     'Lon': np.random.uniform(-10, 10, size=DF_SIZE)})


def haversine(lat2, lon2, lat1, lon1):
    lat1_ = lat1 * np.pi / 180
    lat2_ = lat2 * np.pi / 180
    lon1_ = lon1 * np.pi / 180
    lon2_ = lon2 * np.pi / 180

    a = (np.sin((lat2_ - lat1_) / 2) ** 2)   (np.sin((lon2_ - lon1_) / 2) ** 2) * np.cos(lat1_) * np.cos(lat2_)
    dist = 2 * 6371 * np.arcsin(np.sqrt(a))

    return dist


if __name__ == '__main__':
    lat_loc = df_exposure.loc[:, 'Lat']
    lon_loc = df_exposure.loc[:, 'Lon']

    start = time.monotonic()
    for i in range(len(lat_loc)):
        r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
        dist = r.min()  # find minimum distance
        df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit']   df_exposure.loc[i, 'Limit']
    print(f'with for loop and series time took: {time.monotonic() - start:.1f} s.')

Out:
     with for loop and series time took: 456.3 s.

You should understand that in this case you are passing lat and lon as a pd.Series to the haversine function. With that, your function is vectorized. Let's pass the coordinates as numpy arrays

lon = df['Lon'].values
lat = df['Lat'].values

start = time.monotonic()
    for i in range(len(lat_loc)):
        r = haversine(lat, lon, lat_loc[i], lon_loc[i])
        df.iloc[np.argmin(r), 0] = df.iloc[np.argmin(r), 0]   df_exposure.iloc[i, 0]
    print(f'with for loop and ndarray time took: {time.monotonic() - start:.1f} s.')

Out:
     with for loop and ndarray time took: 62.4 s.

Wow! Speedup is ~ 7 time.

Let's try to use V.M answer and use the DistanceMetric class from sklearn.metrics module:

from sklearn.metrics import DistanceMetric

distance = DistanceMetric.get_metric('haversine')

lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']

lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
start = time.monotonic()
res = (6371 * distance.pairwise((np.array([lat1, lon1]) * np.pi / 180).T,
                                     (np.array([lat2, lon2]) * np.pi / 180).T)).argmin(axis=1)
print(f'with sklearn pairwise distance time took: {time.monotonic() - start:.1f} s.')

Out: 
    with sklearn pairwise distance time took: 45.6 s.

Even better! Speed-up is ~ 10 time

But what if the logic inside the loop is moved to a new function and the apply methods are used?

def foo(row, lat, lon):
    """
    row: row of DataFrame
    lat: ndarray with latitude
    lon: ndarray with longitude
    """
    r = haversine(lat, lon, row[1], row[2])
    return r.argmin()

start = time.monotonic()
res = df_exposure.apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'synchronous apply time took: {time.monotonic() - start:.1f} s.')

Out:
    synchronous apply time took: 32.4 s.

Wow! It's even faster.

Can we speed up our calculations even more? Yes! If we remember that pandas always runs on one core of your CPU. We need to parallelize the best of the received ways. This can easily be done with parallel-pandas

#pip install parallel-pandas
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(disable_pr_bar=True)

#p_apply is a parallel analog of apply method
start = time.monotonic()
res = df_exposure.p_apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'parallel apply time took: {time.monotonic() - start:.1f} s.')

Out: 
     parallel apply time took: 3.7

This is amazing! Total speed-up 456/3.7 ~ 120