Home > Enterprise >  Pandas Dataframe Find the two postcodes furthest apart by district
Pandas Dataframe Find the two postcodes furthest apart by district

Time:01-30

I have a dataframe in Pandas that contains postcode data as below

postcode lat long coordinates district
AB1 0AA 57.101474 -2.242851 (57.101474 -2.242851) Aberdeen City
AB1 0AB 57.102554 -2.246308 (57.102554 -2.246308) Aberdeen City
AB1 0AD 57.100556 -2.248342 (57.100556 -2.248342) Aberdeen City
AB1 0AR 57.091357 -2.224831 (57.091357 -2.224831) Aberdeenshire
AB1 0AS 57.083838 -2.234437 (57.083838 -2.234437) Aberdeenshire
AB1 0AT 57.089299 -2.239768 (57.089299 -2.239768) Aberdeenshire

I would like to find out the two postcodes by district that are furthest apart. I know I can use the haversine library to calculate the distance between two sets of coordinates:

import haversine as hs
dist=hs.haversine(coordinates1,coordinates2)

But how would calculate across all combinations of postcodes within a district group to find the 2 postcodes furthest apart by district ?

Edit: The dataframe has 2656000 rows.

CodePudding user response:

I note that in your example df, 'district' and 'coordinates' seem to be flipped. Code below uses your df as is

The steps are to groupby the 'coordinates' (Aberdeenshire etc) and then do a cross merge on itself, creating all possible pairs of postcodes, apply the distance function and take rows with the max value for each group

df1 = df.groupby('coordinates').apply(lambda g: g.merge(g, how = 'cross'))
df1['dist'] = df1.apply(lambda r: hs.haversine((r['lat_x'], r['long_x']),((r['lat_y'], r['long_y']))), axis=1)
df1.sort_values('dist', ascending = False).groupby('coordinates_x').head(1).reset_index(drop = True)

output

    postcode_x      lat_x    long_x  district_x             coordinates_x    postcode_y      lat_y    long_y  district_y             coordinates_y        dist
--  ------------  -------  --------  ---------------------  ---------------  ------------  -------  --------  ---------------------  ---------------  --------
 0  AB1 0AR       57.0914  -2.22483  (57.091357 -2.224831)  Aberdeenshire    AB1 0AS       57.0838  -2.23444  (57.083838 -2.234437)  Aberdeenshire    1.01777
 1  AB1 0AA       57.1015  -2.24285  (57.101474 -2.242851)  Aberdeen City    AB1 0AD       57.1006  -2.24834  (57.100556 -2.248342)  Aberdeen City    0.346992

CodePudding user response:

You can do this:

Create a copy of you dataframe

df1 = df
df2 = df
temp = df1.assign(A=1).merge(df2.assign(A=1), on='A').drop('A', 1) 

define the Haersine function:

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2   \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

and calculated all pairwise distances:

temp['distances'] = temp.apply(lambda x: haversine(x['lat_x'], x['long_x'],x['lat_y'], x['long_y']), 1)

which gives:


postcode_x      lat_x    long_x              district_x  coordinates_x  \
0     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
1     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
2     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
3     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
4     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
5     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
6     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
7     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
8     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
9     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
10    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
11    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
12    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
13    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
14    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
15    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
16    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
17    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
18    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
19    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
20    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
21    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
22    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
23    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
24    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
25    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
26    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
27    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
28    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
29    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
30    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
31    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
32    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
33    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
34    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
35    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   

   postcode_y      lat_y    long_y              district_y  coordinates_y  \
0     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
1     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
2     AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
3     AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
4     AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
5     AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
6     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
7     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
8     AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
9     AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
10    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
11    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
12    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
13    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
14    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
15    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
16    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
17    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
18    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
19    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
20    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
21    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
22    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
23    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
24    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
25    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
26    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
27    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
28    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
29    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
30    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
31    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
32    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
33    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
34    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
35    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   

    distances  
0    0.000000  
1    0.240859  
2    0.346992  
3    1.565351  
4    2.025836  
5    1.366547  
6    0.240859  
7    0.000000  
8    0.253869  
9    1.798078  
10   2.201213  
11   1.525913  
12   0.346992  
13   0.253869  
14   0.000000  
15   1.750198  
16   2.039937  
17   1.354641  
18   1.565351  
19   1.798078  
20   1.750198  
21   0.000000  
22   1.017773  
23   0.930967  
24   2.025836  
25   2.201213  
26   2.039937  
27   1.017773  
28   0.000000  
29   0.687374  
30   1.366547  
31   1.525913  
32   1.354641  
33   0.930967  
34   0.687374  
35   0.000000  

CodePudding user response:

Updating my answer after finding the dataset that I assume you're working with (full list of UK postcodes). It looks like the largest district by number of postcodes is Birmingham with 34,459 postcodes. A 34,459 x 34,459 distance matrix is right on the edge of what my 16GB RAM machine was able to handle without an out of memory error, but it got there eventually.

The scikit-learn vectorised implementation of the haversine distance formula seems a lot faster than the one I posted in my previous solution, so I have edited the dm = line in the function I posted previously. I also added a print statement showing the district currently being worked on so you can get an idea of progress. If you make it past Birmingham, it should run to completion.

Full code which gave me the desired solution is as follows:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances


def get_furthest_in_district(df):
    print(df['district'].iloc[0], len(df))
    dm = haversine_distances(np.deg2rad(df[['lat', 'long']])) * 6371
    idx_1, idx_2 = np.unravel_index(np.argmax(dm), dm.shape)
    postcode_1 = df['postcode'].iloc[idx_1]
    postcode_2 = df['postcode'].iloc[idx_2]
    distance = dm[idx_1, idx_2]
    return pd.Series(
        data=[postcode_1, postcode_2, distance],
        index=['postcode_1', 'postcode_2', 'distance']
    )


results = df.groupby('district').apply(get_furthest_in_district)

Giving a final DataFrame of 374 rows (one for each district), with the first 5 rows as follows:

              postcode_1 postcode_2    distance
district                                       
Aberdeen City   AB23 8BS   AB31 3AS   20.252278
Aberdeenshire    AB3 5YB   AB43 8WA  133.891028
Adur            BN14 9JU    BN4 1PY    9.932842
Allerdale       CA12 4TP    CA7 5BP   50.893348
Amber Valley    DE55 7EG    DE6 5BG   23.678767
  •  Tags:  
  • Related