I have a problem. I want to get the difference of the last date. For example 2021-03-22 until the next date (2021-03-18) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId. So the complete calculation should be for each customer. And the last date should be None because I dont have any older date.
Dataframe
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-03-18 Dog
2 1 2021-03-22 Cat
3 1 2021-02-10 Cat
4 1 2021-09-07 Cat
5 1 None Elefant
6 1 2022-01-18 Fish
7 2 2021-05-17 Fish
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
print(df)
What I have
customerId fromDate lastInteractivity
0 1 2021-02-10 477 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
.. ... ... ...
339 1 2022-01-15 NaT
340 1 2022-01-16 NaT
341 1 2022-01-17 NaT
342 1 2022-01-18 135 days
343 2 2021-05-17 381 days
[344 rows x 3 columns]
What I want
customerId fromDate otherInformation lastInDays
0 1 2021-02-22 Cat 12 #last date 2021-02-10
1 1 2021-03-18 Dog 36 #last date 2021-02-22
2 1 2021-03-22 Cat 4 #last date 2021-03-18
3 1 2021-02-10 Cat None #last date not found
4 1 2021-09-07 Cat 169 #last date 2021-03-22
5 1 None Elefant None #was None
6 1 2022-01-18 Fish 133 #last date 2021-09-07
7 2 2021-05-17 Fish None #last date not found
CodePudding user response:
Sort the dataframe by date column then group by customerId and shift the date column then subtract this from the original date column to get the difference in days
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-03-18 Dog 24 days
2 1 2021-03-22 Cat 4 days
3 1 2021-02-10 Cat NaT
4 1 2021-09-07 Cat 169 days
5 1 NaT Elefant NaT
6 1 2022-01-18 Fish 133 days
7 2 2021-05-17 Fish NaT
