I have a dataframe that has several columns. I grouped it by 'Name' and sorted by 'Date1'. A subset of the dataset is below :-
data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
, 'Difference': [0, 0, '2 days', 0,0,0,0,'91 days',0,'181 days']}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] = pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] = pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")
I would like to compute the values in the 'Difference' column. It's the difference between the next 'Date1' value and a non-null 'Date2' as grouped by 'Name'. If 'Date2' is null, the Difference can just be set to 0. How can I make the computation in Pandas? Should it be like checking every row in the group by function, then comparing dates in the two columns?
CodePudding user response:
Here is my attempt, it is a bit ugly, but I think it works (except that the last difference is 180 days instead of 181):
import pandas as pd
import numpy as np
data_ = {'Name': ['Alfie', 'Alfie', 'Alfie', 'Alfie','George','George','Radar','Radar','Radar','Radar'],'Date1': ['2009-03-26', '2014-12-23', '2016-03-18', '2021-08-31','2014-01-23','2020-02-20','2008-05-24','2010-06-22','2015-09-10','2017-03-11'], 'Date2': [np.NaN, '2016-03-16', np.NaN, np.NaN,np.NaN,np.NaN,'2010-03-23',np.NaN,'2016-09-12',np.NaN]
}
# Create DataFrame.
df_test = pd.DataFrame(data_)
df_test["Date1"] = pd.to_datetime(df_test["Date1"],format="%Y-%m-%d")
df_test["Date2"] = pd.to_datetime(df_test["Date2"],format="%Y-%m-%d")
difference_list = [0]
rows = list(df_test.iterrows())
for i in range(len(rows)-1):
if not pd.isnull(rows[i][1]["Date2"]) and not pd.isnull(rows[i 1][1]["Date1"]):
difference_list.append((rows[i 1][1]["Date1"] - rows[i][1]["Date2"]).days)
else:
difference_list.append(0)
df_test["Difference"] = difference_list
print(df_test)
Essentially what I am doing is just putting all the rows into a list, and then doing the subtraction as you specified, where the (i 1)th Date1 is subtracted from the ith Date 2 to give the difference. The if and else are just to account for when there are null datetimes.
Output:
Name Date1 Date2 Difference
0 Alfie 2009-03-26 NaT 0
1 Alfie 2014-12-23 2016-03-16 0
2 Alfie 2016-03-18 NaT 2
3 Alfie 2021-08-31 NaT 0
4 George 2014-01-23 NaT 0
5 George 2020-02-20 NaT 0
6 Radar 2008-05-24 2010-03-23 0
7 Radar 2010-06-22 NaT 91
8 Radar 2015-09-10 2016-09-12 0
9 Radar 2017-03-11 NaT 180
CodePudding user response:
Try this:
s = df_test['Date1'].sub(df_test.groupby('Name')['Date2'].shift())
s.where(s.notna(),0)
