Compute date ranges with Nulls in Pandas-CodePudding

I have a dataframe that has several columns. I grouped it by 'Name' and sorted by 'Date1'. A subset of the dataset is below :-

data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
        , 'Difference': [0, 0, '2 days', 0,0,0,0,'91 days',0,'181 days']}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")

I would like to compute the values in the 'Difference' column. It's the difference between the next 'Date1' value and a non-null 'Date2' as grouped by 'Name'. If 'Date2' is null, the Difference can just be set to 0. How can I make the computation in Pandas? Should it be like checking every row in the group by function, then comparing dates in the two columns?

CodePudding user response：

Here is my attempt, it is a bit ugly, but I think it works (except that the last difference is 180 days instead of 181):

import pandas as pd
import numpy as np


data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
        }
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")

difference_list = [0]

rows = list(df_test.iterrows())

for i in range(len(rows)-1):
    if not pd.isnull(rows[i][1]["Date2"]) and not pd.isnull(rows[i 1][1]["Date1"]): 
        difference_list.append((rows[i 1][1]["Date1"] - rows[i][1]["Date2"]).days)
    else:
        difference_list.append(0)

df_test["Difference"] = difference_list
print(df_test)

Essentially what I am doing is just putting all the rows into a list, and then doing the subtraction as you specified, where the (i 1)th Date1 is subtracted from the ith Date 2 to give the difference. The if and else are just to account for when there are null datetimes.

Output:

     Name      Date1      Date2  Difference
0   Alfie 2009-03-26        NaT           0
1   Alfie 2014-12-23 2016-03-16           0
2   Alfie 2016-03-18        NaT           2
3   Alfie 2021-08-31        NaT           0
4  George 2014-01-23        NaT           0
5  George 2020-02-20        NaT           0
6   Radar 2008-05-24 2010-03-23           0
7   Radar 2010-06-22        NaT          91
8   Radar 2015-09-10 2016-09-12           0
9   Radar 2017-03-11        NaT         180

CodePudding user response：

Try this:

s = df_test['Date1'].sub(df_test.groupby('Name')['Date2'].shift())
s.where(s.notna(),0)