I have a dataframe in Pandas that contains postcode data as below
| postcode | lat | long | coordinates | district |
|---|---|---|---|---|
| AB1 0AA | 57.101474 | -2.242851 | (57.101474 -2.242851) | Aberdeen City |
| AB1 0AB | 57.102554 | -2.246308 | (57.102554 -2.246308) | Aberdeen City |
| AB1 0AD | 57.100556 | -2.248342 | (57.100556 -2.248342) | Aberdeen City |
| AB1 0AR | 57.091357 | -2.224831 | (57.091357 -2.224831) | Aberdeenshire |
| AB1 0AS | 57.083838 | -2.234437 | (57.083838 -2.234437) | Aberdeenshire |
| AB1 0AT | 57.089299 | -2.239768 | (57.089299 -2.239768) | Aberdeenshire |
I would like to find out the two postcodes by district that are furthest apart. I know I can use the haversine library to calculate the distance between two sets of coordinates:
import haversine as hs
dist=hs.haversine(coordinates1,coordinates2)
But how would calculate across all combinations of postcodes within a district group to find the 2 postcodes furthest apart by district ?
Edit: The dataframe has 2656000 rows.
CodePudding user response:
I note that in your example df, 'district' and 'coordinates' seem to be flipped. Code below uses your df as is
The steps are to groupby the 'coordinates' (Aberdeenshire etc) and then do a cross merge on itself, creating all possible pairs of postcodes, apply the distance function and take rows with the max value for each group
df1 = df.groupby('coordinates').apply(lambda g: g.merge(g, how = 'cross'))
df1['dist'] = df1.apply(lambda r: hs.haversine((r['lat_x'], r['long_x']),((r['lat_y'], r['long_y']))), axis=1)
df1.sort_values('dist', ascending = False).groupby('coordinates_x').head(1).reset_index(drop = True)
output
postcode_x lat_x long_x district_x coordinates_x postcode_y lat_y long_y district_y coordinates_y dist
-- ------------ ------- -------- --------------------- --------------- ------------ ------- -------- --------------------- --------------- --------
0 AB1 0AR 57.0914 -2.22483 (57.091357 -2.224831) Aberdeenshire AB1 0AS 57.0838 -2.23444 (57.083838 -2.234437) Aberdeenshire 1.01777
1 AB1 0AA 57.1015 -2.24285 (57.101474 -2.242851) Aberdeen City AB1 0AD 57.1006 -2.24834 (57.100556 -2.248342) Aberdeen City 0.346992
CodePudding user response:
You can do this:
Create a copy of you dataframe
df1 = df
df2 = df
temp = df1.assign(A=1).merge(df2.assign(A=1), on='A').drop('A', 1)
define the Haersine function:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
and calculated all pairwise distances:
temp['distances'] = temp.apply(lambda x: haversine(x['lat_x'], x['long_x'],x['lat_y'], x['long_y']), 1)
which gives:
postcode_x lat_x long_x district_x coordinates_x \
0 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
1 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
2 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
3 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
4 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
5 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
6 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
7 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
8 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
9 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
10 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
11 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
12 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
13 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
14 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
15 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
16 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
17 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
18 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
19 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
20 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
21 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
22 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
23 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
24 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
25 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
26 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
27 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
28 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
29 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
30 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
31 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
32 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
33 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
34 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
35 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
postcode_y lat_y long_y district_y coordinates_y \
0 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
1 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
2 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
3 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
4 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
5 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
6 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
7 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
8 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
9 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
10 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
11 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
12 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
13 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
14 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
15 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
16 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
17 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
18 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
19 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
20 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
21 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
22 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
23 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
24 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
25 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
26 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
27 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
28 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
29 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
30 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
31 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
32 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
33 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
34 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
35 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
distances
0 0.000000
1 0.240859
2 0.346992
3 1.565351
4 2.025836
5 1.366547
6 0.240859
7 0.000000
8 0.253869
9 1.798078
10 2.201213
11 1.525913
12 0.346992
13 0.253869
14 0.000000
15 1.750198
16 2.039937
17 1.354641
18 1.565351
19 1.798078
20 1.750198
21 0.000000
22 1.017773
23 0.930967
24 2.025836
25 2.201213
26 2.039937
27 1.017773
28 0.000000
29 0.687374
30 1.366547
31 1.525913
32 1.354641
33 0.930967
34 0.687374
35 0.000000
CodePudding user response:
Updating my answer after finding the dataset that I assume you're working with (full list of UK postcodes). It looks like the largest district by number of postcodes is Birmingham with 34,459 postcodes. A 34,459 x 34,459 distance matrix is right on the edge of what my 16GB RAM machine was able to handle without an out of memory error, but it got there eventually.
The scikit-learn vectorised implementation of the haversine distance formula seems a lot faster than the one I posted in my previous solution, so I have edited the dm = line in the function I posted previously. I also added a print statement showing the district currently being worked on so you can get an idea of progress. If you make it past Birmingham, it should run to completion.
Full code which gave me the desired solution is as follows:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
def get_furthest_in_district(df):
print(df['district'].iloc[0], len(df))
dm = haversine_distances(np.deg2rad(df[['lat', 'long']])) * 6371
idx_1, idx_2 = np.unravel_index(np.argmax(dm), dm.shape)
postcode_1 = df['postcode'].iloc[idx_1]
postcode_2 = df['postcode'].iloc[idx_2]
distance = dm[idx_1, idx_2]
return pd.Series(
data=[postcode_1, postcode_2, distance],
index=['postcode_1', 'postcode_2', 'distance']
)
results = df.groupby('district').apply(get_furthest_in_district)
Giving a final DataFrame of 374 rows (one for each district), with the first 5 rows as follows:
postcode_1 postcode_2 distance
district
Aberdeen City AB23 8BS AB31 3AS 20.252278
Aberdeenshire AB3 5YB AB43 8WA 133.891028
Adur BN14 9JU BN4 1PY 9.932842
Allerdale CA12 4TP CA7 5BP 50.893348
Amber Valley DE55 7EG DE6 5BG 23.678767
