Python dataframe find closest date for each ID-CodePudding

I have a dataframe like this:

data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}

I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:

def nearest(items, pivot):
    return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))

And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!

Thanks for any suggestions :)

CodePudding user response：

IIUC, you can use:

data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)

out = data.loc[data['HandoverDateA']
               .sub(data['Sent']).abs()
               .groupby(data['ID']).idxmin()]

Output:

    SalePrice HandoverDateA      ID Tranche  Totals       Sent  Amount      Opened
10          8    2022-03-30    Adam     Red      70 2022-03-12      20  2021-12-29
5           3    2022-04-30     Ben    Blue      90 2022-03-03      60  2021-12-19
7           1    2022-04-30    Eden     Red      70 2022-01-04      10  2021-12-29
4          20    2022-04-30  Joseph     Red      50 2022-04-22      15  2022-12-29
8           4    2022-04-30     Tim     Red      60 2022-01-10      10  2021-12-29
2          10    2022-04-30     Tom     Red     100 2022-03-14      14  2021-12-29

CodePudding user response：

Considering that the goal is to find the sent date which is closest to the HandoverDate, one approach would be as follows.

First of all, create the dataframe df from data with pandas.DataFrame

import pandas as pd

df = pd.DataFrame(data)

Then, make sure that the columns HandoverDateA and Sent are of datetime using pandas.to_datetime

df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])

Then, in order to make it more convenient, create a column, diff, to store the absolute value of the difference between the columns HandoverDateA and Sent

df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()

With that column, one can simply sort by that column as follows

df = df.sort_values(by=['diff'])

[Out]:

    SalePrice HandoverDateA      ID  ... Amount      Opened diff
4          20    2022-04-30  Joseph  ...     15  2022-12-29    8
10          8    2022-03-30    Adam  ...     20  2021-12-29   18
2          10    2022-04-30     Tom  ...     14  2021-12-29   47
5           3    2022-04-30     Ben  ...     60  2021-12-19   58
8           4    2022-04-30     Tim  ...     10  2021-12-29  110
7           1    2022-04-30    Eden  ...     10  2021-12-29  116

and the first row is the one where Sent is closest to HandOverDateA.

With the column diff, one option to get the one where diff is minimum is with pandas.DataFrame.query as follows

df = df.query('diff == diff.min()')

[Out]:

   SalePrice HandoverDateA      ID Tranche  ...       Sent Amount      Opened diff
4         20    2022-04-30  Joseph     Red  ... 2022-04-22     15  2022-12-29    8

Notes:

For more information on sorting dataframes by columns, read my answer here.