Home > database >  Compute date ranges with Nulls in Pandas
Compute date ranges with Nulls in Pandas

Time:01-25

I have a dataframe that has several columns. I grouped it by 'Name' and sorted by 'Date1'. A subset of the dataset is below :-

data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
        , 'Difference': [0, 0, '2 days', 0,0,0,0,'91 days',0,'181 days']}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")

I would like to compute the values in the 'Difference' column. It's the difference between the next 'Date1' value and a non-null 'Date2' as grouped by 'Name'. If 'Date2' is null, the Difference can just be set to 0. How can I make the computation in Pandas? Should it be like checking every row in the group by function, then comparing dates in the two columns?

CodePudding user response:

Here is my attempt, it is a bit ugly, but I think it works (except that the last difference is 180 days instead of 181):

import pandas as pd
import numpy as np


data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
        }
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] =  pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] =  pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")

difference_list = [0]

rows = list(df_test.iterrows())

for i in range(len(rows)-1):
    if not pd.isnull(rows[i][1]["Date2"]) and not pd.isnull(rows[i 1][1]["Date1"]): 
        difference_list.append((rows[i 1][1]["Date1"] - rows[i][1]["Date2"]).days)
    else:
        difference_list.append(0)

df_test["Difference"] = difference_list
print(df_test)

Essentially what I am doing is just putting all the rows into a list, and then doing the subtraction as you specified, where the (i 1)th Date1 is subtracted from the ith Date 2 to give the difference. The if and else are just to account for when there are null datetimes.

Output:

     Name      Date1      Date2  Difference
0   Alfie 2009-03-26        NaT           0
1   Alfie 2014-12-23 2016-03-16           0
2   Alfie 2016-03-18        NaT           2
3   Alfie 2021-08-31        NaT           0
4  George 2014-01-23        NaT           0
5  George 2020-02-20        NaT           0
6   Radar 2008-05-24 2010-03-23           0
7   Radar 2010-06-22        NaT          91
8   Radar 2015-09-10 2016-09-12           0
9   Radar 2017-03-11        NaT         180

CodePudding user response:

Try this:

s = df_test['Date1'].sub(df_test.groupby('Name')['Date2'].shift())
s.where(s.notna(),0)
  •  Tags:  
  • Related