Merge consecutive delivery date (with certain condition) into one merge ID in R-CodePudding

So I was looking to merge consecutive dates into one merge ID (grouped by SHIP_TO). But, if the consecutive date is longer than 3 days, the merge id will be different too. Here are the following code that I used

dhl_mutate <- dhl %>%
  arrange(SHIP_TO) %>%
  group_by(MERGE_ID= cumsum(c(TRUE, diff(DELIVERY_DATE) > 1))) %>%
  ungroup()

here are the following output that I got

SHIP_TO	DELIVERY_DATE	MERGE_ID
2006	2021-04-01	1
2006	2021-04-02	1
2006	2021-04-03	1
2006	2021-04-04	1
2006	2021-04-19	2
2006	2021-04-23	3
2006	2021-04-27	4
2010	2021-06-01	5
2010	2021-06-02	5
2010	2021-07-05	6
2010	2021-04-07	7

My expected output is that for every three consecutive dates, the system will generate a new merge id and the new delivery date will be the same for the last date within the same merge ID

SHIP_TO	DELIVERY_DATE	MERGE_ID	NEW_DELIVERY_DATE
2006	2021-04-01	1	2021-04-03
2006	2021-04-02	1	2021-04-03
2006	2021-04-03	1	2021-04-03
2006	2021-04-04	2	2021-04-04
2006	2021-04-19	3	2021-04-19
2006	2021-04-23	4	2021-04-23
2006	2021-04-27	5	2021-04-27
2010	2021-06-01	6	2021-06-02
2010	2021-06-02	6	2021-06-02
2010	2021-07-05	7	2021-07-05
2010	2021-04-07	8	2021-04-07

I would be extremely grateful if someone could help me with this, thank you!

CodePudding user response：

Here is one approach using ceiling_date from lubridate -

library(dplyr)
library(lubridate)

df <- df %>%
  mutate(DELIVERY_DATE = as.Date(DELIVERY_DATE), 
         NEW_DELIVERY_DATE = ceiling_date(DELIVERY_DATE, '3 days'), 
         MERGE_ID = match(NEW_DELIVERY_DATE, unique(NEW_DELIVERY_DATE))) %>%
  group_by(MERGE_ID) %>%
  mutate(NEW_DELIVERY_DATE = max(DELIVERY_DATE)) %>%
  ungroup

df

#   SHIP_TO DELIVERY_DATE MERGE_ID NEW_DELIVERY_DATE
#     <int> <date>           <int> <date>           
# 1    2006 2021-04-01           1 2021-04-03       
# 2    2006 2021-04-02           1 2021-04-03       
# 3    2006 2021-04-03           1 2021-04-03       
# 4    2006 2021-04-04           2 2021-04-04       
# 5    2006 2021-04-19           3 2021-04-19       
# 6    2006 2021-04-23           4 2021-04-23       
# 7    2006 2021-04-27           5 2021-04-27       
# 8    2010 2021-06-01           6 2021-06-02       
# 9    2010 2021-06-02           6 2021-06-02       
#10    2010 2021-07-05           7 2021-07-05       
#11    2010 2021-04-07           8 2021-04-07

ceiling_date rounds the date up every 3 days, using match and unique we get a unique MERGE_ID and for each MERGE_ID we select the max DELIVERY_DATE.

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(SHIP_TO = c(2006L, 2006L, 2006L, 2006L, 2006L, 
2006L, 2006L, 2010L, 2010L, 2010L, 2010L), DELIVERY_DATE = c("2021-04-01", 
"2021-04-02", "2021-04-03", "2021-04-04", "2021-04-19", "2021-04-23", 
"2021-04-27", "2021-06-01", "2021-06-02", "2021-07-05", "2021-04-07"
), MERGE_ID = c(1L, 1L, 1L, 1L, 2L, 3L, 4L, 5L, 5L, 6L, 7L)), 
row.names = c(NA, -11L), class = "data.frame")