I have a dataframe that is a time series and I want to get the cumsum of differences between the CLOSE and SUBMISSION of an issue. However, I want it only to subtract if the CLOSE value is higher than the SUBMISSION value. Here are the data points (sorted by CLOSE), the expected output, and my attempted code:
df = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27']})
df['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
For REF_KEY == 1, ACCUM_DATE_DELTA should be the sum of:
- 15 day difference of ('2018-09-05' - '2018-08-21')
- 2 day difference between ('2018-09-05' - '09-03-2018')
- 8 day difference between ('2018-09-05' - '2018-08-28') making it 26
For REF_KEY == 2, you will get the sum of:
- 22 day difference between ('2018-09-12' - '2018-08-21')
- 9 day difference between ('2018-09-12' - '2018-09-03')
- 5 day difference between ('2018-09-12' - '2018-09-07')
- 6 day difference between ('2018-09-12' - '2018-09-06')
- 15 day difference between ('2018-09-12' - '2018-08-28')
So for REF_KEY == 1, you can see that the difference between its close date includes REF_KEY == [3, 4], and that is because the CLOSE is greater than SUBMISSION. Therefore, I had the idea of creating a condition where the CLOSE date has to be more than SUBMISSION date.
df_2 = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5],
'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27'], 'ACCUM_DATE_DELTA': [25, 57, 86, 116, 131]})
df_2['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df_2['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
Attempted code:
df_2['ACCUM_DATE_DELTA'] = df_2['CLOSE']*len(df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()) - df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()
CodePudding user response:
I'm not whose calculations are correct. If mine are, let me know if this is what you want.
For 'REF_KEY' == 1, you can use this to find the cumulative sum of days
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df['CLOSE'][0] is first close date to calculate against all submission dates; dt.days gives days in integer
(df['CLOSE'][0] - df['SUBMISSION']).dt.days
0 15
1 2
2 -2
3 -1
4 8
Name: SUBMISSION, dtype: int64
Use clip(lower=0).sum() to change negative values to zero and sum
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
result = 25
To automate this, use apply() with a custom function
def calc(x):
# print((x - df['SUBMISSION']).dt.days.clip(lower=0).sum())
return (x - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df
REF_KEY SUBMISSION CLOSE
0 1 2018-08-21 2018-09-05
1 2 2018-09-03 2018-09-12
2 3 2018-09-07 2018-09-18
3 4 2018-09-06 2018-09-24
4 5 2018-08-28 2018-09-27
df['ACCUM_DATE_DELTA'] = df.apply(lambda x: calc(x['CLOSE']), axis=1)
REF_KEY SUBMISSION CLOSE ACCUM_DATE_DELTA
0 1 2018-08-21 2018-09-05 25
1 2 2018-09-03 2018-09-12 57
2 3 2018-09-07 2018-09-18 87
3 4 2018-09-06 2018-09-24 117
4 5 2018-08-28 2018-09-27 132
CodePudding user response:
- Cross-
mergeto generate their cartesian product ofSUBMISSIONxCLOSE - Keep only the rows
whereCLOSE > SUBMISSION groupbytheCLOSEdates and sum the group'sCLOSE - SUBMISSIONdaysmergetheACCUMvalues back to the original df
m = pd.merge(df.SUBMISSION, df.CLOSE, how='cross') # cross-merge for all SUBMISSION x CLOSE combos
accum = (m.where(m.CLOSE > m.SUBMISSION) # limit to CLOSE > SUBMISSION
.groupby('CLOSE').SUBMISSION # group by CLOSE
.apply(lambda g: (g.name - g).sum()) # sum of all (CLOSE - SUBMISSION)
.rename('ACCUM'))
df.merge(accum, on='CLOSE') # merge back to df
Output:
REF_KEY SUBMISSION CLOSE ACCUM
0 1 2018-08-21 2018-09-05 25 days
1 2 2018-09-03 2018-09-12 57 days
2 3 2018-09-07 2018-09-18 87 days
3 4 2018-09-06 2018-09-24 117 days
4 5 2018-08-28 2018-09-27 132 days
Notes:
how='cross'requires pandas 1.2.0 , so for earlier versions,mergeon a dummykeycolumn:m = df[['SUBMISSION']].assign(key=0).merge(df[['CLOSE']].assign(key=0), on='key').drop(columns='key')As with Jonathan's solution, a couple of these days are off by 1 compared to your output.
