I have such dataset
mydata=structure(list(SAP = c("S985", "S985", "S985", "S985", "S985",
"S985", "S985", "S985"), code = c(50000690L, 50000690L, 50000690L,
50000690L, 801L, 50000690L, 50000690L, 50000690L), dt_start = c("18.01.2022 8:00",
"18.01.2022 12:00", "18.01.2022 12:00", "18.01.2022 10:00", "22.01.2022 10:00",
"22.01.2022 10:00", "22.01.2022 12:00", "22.01.2022 12:00"),
dt_end = c("18.01.2022 10:00", "18.01.2022 22:00", "18.01.2022 22:00",
"18.01.2022 20:00", "22.01.2022 20:00", "22.01.2022 20:00",
"22.01.2022 22:00", "22.01.2022 22:00"), person_number = c(938541L,
938541L, 938541L, 938541L, 754105L, 867801L, 865394L, 938541L
)), class = "data.frame", row.names = c(NA, -8L))
In this dataset, we see that there is column person_number=938541 who have only one date 18.01 in the columns dt start and dt end. This means that on this date he works alone (do not pay attention to duplicates in the date, this is not important now). But in general, we see that whe have 4 unique person_number (persons).
754105
867801
865394
938541
How to do that if it so happened that on some date one person works, then we are looking for a all unique personal_number, and we sign randomly anyone on the date where work just one personal_number. For example desired result like this
SAP code dt_start dt_end person_number
S985 50000690 18.01.2022 8:00 18.01.2022 10:00 938541
S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541
S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541
S985 50000690 18.01.2022 10:00 18.01.2022 20:00 938541
**S985 50000690 18.01.2022 12:00 18.01.2022 22:00 865394**
S985 801 22.01.2022 10:00 22.01.2022 20:00 754105
S985 50000690 22.01.2022 10:00 22.01.2022 20:00 867801
S985 50000690 22.01.2022 12:00 22.01.2022 22:00 865394
S985 50000690 22.01.2022 12:00 22.01.2022 22:00 938541
randomly i chose 865394 and pasted it on 18.01 date.So on 18.01 date two people work instead of one.
Your help will be appreciated. Thank you.
CodePudding user response:
Both options choose the first dt_start/dt_end to which the randomly-assigned person_number is assigned to augment a single-worker.
Side note: your timestamps should likely be "real" POSIXt-class, for two reasons: (1) sort-order will be wrong now, since you have day-month-year; (2) if you want to calculate differences or such, you need to be number-like. Neither of my solutions is doing this, a quick as.POSIXct(., format="%d.%m.%Y %H:%M") on each will work. Both of the solutions work as-is if the columns are character or POSIXt.
dplyr
library(dplyr)
persons <- unique(mydata$person_number)
mydata %>%
group_by(date = substring(dt_start, 1, 10)) %>%
summarize(
n = n_distinct(person_number),
across(-person_number, ~ .[1]),
person_number = sample(setdiff(persons, person_number))[1]
) %>%
filter(n == 1L) %>%
select(-n, -date) %>%
bind_rows(mydata)
# # A tibble: 9 x 5
# SAP code dt_start dt_end person_number
# <chr> <int> <chr> <chr> <int>
# 1 S985 50000690 18.01.2022 8:00 18.01.2022 10:00 754105
# 2 S985 50000690 18.01.2022 8:00 18.01.2022 10:00 938541
# 3 S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541
# 4 S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541
# 5 S985 50000690 18.01.2022 10:00 18.01.2022 20:00 938541
# 6 S985 801 22.01.2022 10:00 22.01.2022 20:00 754105
# 7 S985 50000690 22.01.2022 10:00 22.01.2022 20:00 867801
# 8 S985 50000690 22.01.2022 12:00 22.01.2022 22:00 865394
# 9 S985 50000690 22.01.2022 12:00 22.01.2022 22:00 938541
data.table
library(data.table)
myDT <- as.data.table(mydata) # just for local testing, should use setDT(mydata)
persons <- unique(mydata$person_number)
rbindlist(list(
myDT,
myDT[, date := substring(dt_start, 1, 10)
][, cbind(.SD[1,], n = uniqueN(person_number), person_number = sample(setdiff(persons, person_number))[1]),
by = date, .SDcols = -"person_number"
][ n == 1L,][, n := NULL]
), use.names = TRUE)
# SAP code dt_start dt_end person_number date
# <char> <int> <char> <char> <int> <char>
# 1: S985 50000690 18.01.2022 8:00 18.01.2022 10:00 938541 18.01.2022
# 2: S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541 18.01.2022
# 3: S985 50000690 18.01.2022 12:00 18.01.2022 22:00 938541 18.01.2022
# 4: S985 50000690 18.01.2022 10:00 18.01.2022 20:00 938541 18.01.2022
# 5: S985 801 22.01.2022 10:00 22.01.2022 20:00 754105 22.01.2022
# 6: S985 50000690 22.01.2022 10:00 22.01.2022 20:00 867801 22.01.2022
# 7: S985 50000690 22.01.2022 12:00 22.01.2022 22:00 865394 22.01.2022
# 8: S985 50000690 22.01.2022 12:00 22.01.2022 22:00 938541 22.01.2022
# 9: S985 50000690 18.01.2022 8:00 18.01.2022 10:00 754105 18.01.2022
This could also benefit from using %>% for piping to better show the flow of data (i.e., readability), I chose against it to not blur the packages/dialects.
This alternative (more concise) is derived from Henrik's awesome suggestion:
myDT[, date := substring(dt_start, 1, 10)
][ , if (uniqueN(person_number) == 1) {
.SD[c(1, 1:.N)][1, person_number := sample(setdiff(persons, person_number), 1)]
} else .SD, by = date]
