I have 2 dataframes:
- df_exposure (>300k rows)
ID Limit Lat Lon
0 1 49 21.066107 121.930200
1 2 49 20.932773 121.913533
2 3 49 20.932773 121.921867
3 4 49 20.924440 121.930200
4 5 49 20.899440 121.905200
from df_exposure i extract
lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']
- df (3k rows):
Lat Lon Limit
0 4.125 116.125 0.0
1 4.375 116.125 0.0
2 4.625 116.125 0.0
3 4.875 116.125 0.0
4 5.125 116.125 0.0
And this is the haversine function:
def haversine(lat2, lon2, lat1, lon1):
lat1_ = lat1 * np.pi / 180
lat2_ = lat2 * np.pi / 180
lon1_ = lon1 * np.pi / 180
lon2_ = lon2 * np.pi / 180
a = (np.sin((lat2_ - lat1_) / 2)**2) (np.sin((lon2_ - lon1_) / 2)**2) * np.cos(lat1_) * np.cos(lat2_)
dist = 2 * 6371 * np.arcsin(np.sqrt(a))
return dist
Essentially, the df is a subset of df_exposure with bigger grid size and I would like to get the get the distance between all locations in df against each location (row) of lat long in df_exposure to find the minimum distance and allocate the Limit in the corresponding df_exposure row to location in df with smallest distance and this will be iterative on each location in df_exposure until all locations are calculated.
This is how currently it is done but it takes a very long time because of the size of df_exposure (>300k rows)
for i in range(len(lat_loc)):
r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
dist = r.min() # find minimum distance
df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] df_exposure.loc[i, 'Limit']
I would appreciate some suggestion to improve the current code. Thank you.
CodePudding user response:
You can use sklearn.neighbors.DistanceMetric for the haversine distance,
from sklearn.neighbors import DistanceMetric
distance = DistanceMetric.get_metric('haversine')
lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']
lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
(6371*distance.pairwise((np.array([lat1,lon1])* np.pi / 180).T,
(np.array([lat2,lon2])* np.pi / 180).T).min(1))
CodePudding user response:
Let's go in order. I have created dataframes with the specified dimensions. Here is the runtime of your implementation:
import time
import numpy as np
import pandas as pd
EXPOSURE_SIZE = 300_000
DF_SIZE = 3000
df_exposure = pd.DataFrame({'Limit': np.random.randint(0, 1000, size=(EXPOSURE_SIZE,)),
'Lat': np.random.uniform(-10, 10, size=EXPOSURE_SIZE),
'Lon': np.random.uniform(-10, 10, size=EXPOSURE_SIZE)})
df = pd.DataFrame(
{'Limit': np.random.randint(0, 1000, size=(DF_SIZE,)),
'Lat': np.random.uniform(-10, 10, size=DF_SIZE),
'Lon': np.random.uniform(-10, 10, size=DF_SIZE)})
def haversine(lat2, lon2, lat1, lon1):
lat1_ = lat1 * np.pi / 180
lat2_ = lat2 * np.pi / 180
lon1_ = lon1 * np.pi / 180
lon2_ = lon2 * np.pi / 180
a = (np.sin((lat2_ - lat1_) / 2) ** 2) (np.sin((lon2_ - lon1_) / 2) ** 2) * np.cos(lat1_) * np.cos(lat2_)
dist = 2 * 6371 * np.arcsin(np.sqrt(a))
return dist
if __name__ == '__main__':
lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']
start = time.monotonic()
for i in range(len(lat_loc)):
r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
dist = r.min() # find minimum distance
df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] df_exposure.loc[i, 'Limit']
print(f'with for loop and series time took: {time.monotonic() - start:.1f} s.')
Out:
with for loop and series time took: 456.3 s.
You should understand that in this case you are passing lat and lon as a pd.Series to the haversine function. With that, your function is vectorized. Let's pass the coordinates as numpy arrays
lon = df['Lon'].values
lat = df['Lat'].values
start = time.monotonic()
for i in range(len(lat_loc)):
r = haversine(lat, lon, lat_loc[i], lon_loc[i])
df.iloc[np.argmin(r), 0] = df.iloc[np.argmin(r), 0] df_exposure.iloc[i, 0]
print(f'with for loop and ndarray time took: {time.monotonic() - start:.1f} s.')
Out:
with for loop and ndarray time took: 62.4 s.
Wow! Speedup is ~ 7 time.
Let's try to use V.M answer and use the DistanceMetric class from sklearn.metrics module:
from sklearn.metrics import DistanceMetric
distance = DistanceMetric.get_metric('haversine')
lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']
lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
start = time.monotonic()
res = (6371 * distance.pairwise((np.array([lat1, lon1]) * np.pi / 180).T,
(np.array([lat2, lon2]) * np.pi / 180).T)).argmin(axis=1)
print(f'with sklearn pairwise distance time took: {time.monotonic() - start:.1f} s.')
Out:
with sklearn pairwise distance time took: 45.6 s.
Even better! Speed-up is ~ 10 time
But what if the logic inside the loop is moved to a new function and the apply methods are used?
def foo(row, lat, lon):
"""
row: row of DataFrame
lat: ndarray with latitude
lon: ndarray with longitude
"""
r = haversine(lat, lon, row[1], row[2])
return r.argmin()
start = time.monotonic()
res = df_exposure.apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'synchronous apply time took: {time.monotonic() - start:.1f} s.')
Out:
synchronous apply time took: 32.4 s.
Wow! It's even faster.
Can we speed up our calculations even more? Yes! If we remember that pandas always runs on one core of your CPU. We need to parallelize the best of the received ways. This can easily be done with parallel-pandas
#pip install parallel-pandas
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(disable_pr_bar=True)
#p_apply is a parallel analog of apply method
start = time.monotonic()
res = df_exposure.p_apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'parallel apply time took: {time.monotonic() - start:.1f} s.')
Out:
parallel apply time took: 3.7
This is amazing!
Total speed-up 456/3.7 ~ 120
