So I was looking to merge consecutive dates into one merge ID (grouped by SHIP_TO). But, if the consecutive date is longer than 3 days, the merge id will be different too. Here are the following code that I used
dhl_mutate <- dhl %>%
arrange(SHIP_TO) %>%
group_by(MERGE_ID= cumsum(c(TRUE, diff(DELIVERY_DATE) > 1))) %>%
ungroup()
here are the following output that I got
| SHIP_TO | DELIVERY_DATE | MERGE_ID |
|---|---|---|
| 2006 | 2021-04-01 | 1 |
| 2006 | 2021-04-02 | 1 |
| 2006 | 2021-04-03 | 1 |
| 2006 | 2021-04-04 | 1 |
| 2006 | 2021-04-19 | 2 |
| 2006 | 2021-04-23 | 3 |
| 2006 | 2021-04-27 | 4 |
| 2010 | 2021-06-01 | 5 |
| 2010 | 2021-06-02 | 5 |
| 2010 | 2021-07-05 | 6 |
| 2010 | 2021-04-07 | 7 |
My expected output is that for every three consecutive dates, the system will generate a new merge id and the new delivery date will be the same for the last date within the same merge ID
| SHIP_TO | DELIVERY_DATE | MERGE_ID | NEW_DELIVERY_DATE |
|---|---|---|---|
| 2006 | 2021-04-01 | 1 | 2021-04-03 |
| 2006 | 2021-04-02 | 1 | 2021-04-03 |
| 2006 | 2021-04-03 | 1 | 2021-04-03 |
| 2006 | 2021-04-04 | 2 | 2021-04-04 |
| 2006 | 2021-04-19 | 3 | 2021-04-19 |
| 2006 | 2021-04-23 | 4 | 2021-04-23 |
| 2006 | 2021-04-27 | 5 | 2021-04-27 |
| 2010 | 2021-06-01 | 6 | 2021-06-02 |
| 2010 | 2021-06-02 | 6 | 2021-06-02 |
| 2010 | 2021-07-05 | 7 | 2021-07-05 |
| 2010 | 2021-04-07 | 8 | 2021-04-07 |
I would be extremely grateful if someone could help me with this, thank you!
CodePudding user response:
Here is one approach using ceiling_date from lubridate -
library(dplyr)
library(lubridate)
df <- df %>%
mutate(DELIVERY_DATE = as.Date(DELIVERY_DATE),
NEW_DELIVERY_DATE = ceiling_date(DELIVERY_DATE, '3 days'),
MERGE_ID = match(NEW_DELIVERY_DATE, unique(NEW_DELIVERY_DATE))) %>%
group_by(MERGE_ID) %>%
mutate(NEW_DELIVERY_DATE = max(DELIVERY_DATE)) %>%
ungroup
df
# SHIP_TO DELIVERY_DATE MERGE_ID NEW_DELIVERY_DATE
# <int> <date> <int> <date>
# 1 2006 2021-04-01 1 2021-04-03
# 2 2006 2021-04-02 1 2021-04-03
# 3 2006 2021-04-03 1 2021-04-03
# 4 2006 2021-04-04 2 2021-04-04
# 5 2006 2021-04-19 3 2021-04-19
# 6 2006 2021-04-23 4 2021-04-23
# 7 2006 2021-04-27 5 2021-04-27
# 8 2010 2021-06-01 6 2021-06-02
# 9 2010 2021-06-02 6 2021-06-02
#10 2010 2021-07-05 7 2021-07-05
#11 2010 2021-04-07 8 2021-04-07
ceiling_date rounds the date up every 3 days, using match and unique we get a unique MERGE_ID and for each MERGE_ID we select the max DELIVERY_DATE.
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(SHIP_TO = c(2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2010L, 2010L, 2010L, 2010L), DELIVERY_DATE = c("2021-04-01",
"2021-04-02", "2021-04-03", "2021-04-04", "2021-04-19", "2021-04-23",
"2021-04-27", "2021-06-01", "2021-06-02", "2021-07-05", "2021-04-07"
), MERGE_ID = c(1L, 1L, 1L, 1L, 2L, 3L, 4L, 5L, 5L, 6L, 7L)),
row.names = c(NA, -11L), class = "data.frame")
