How to randomly assign a user id to a different date using R-CodePudding

I have such dataset

mydata=structure(list(SAP = c("S985", "S985", "S985", "S985", "S985", 
"S985", "S985", "S985"), code = c(50000690L, 50000690L, 50000690L, 
50000690L, 801L, 50000690L, 50000690L, 50000690L), dt_start = c("18.01.2022 8:00", 
"18.01.2022 12:00", "18.01.2022 12:00", "18.01.2022 10:00", "22.01.2022 10:00", 
"22.01.2022 10:00", "22.01.2022 12:00", "22.01.2022 12:00"), 
    dt_end = c("18.01.2022 10:00", "18.01.2022 22:00", "18.01.2022 22:00", 
    "18.01.2022 20:00", "22.01.2022 20:00", "22.01.2022 20:00", 
    "22.01.2022 22:00", "22.01.2022 22:00"), person_number = c(938541L, 
    938541L, 938541L, 938541L, 754105L, 867801L, 865394L, 938541L
    )), class = "data.frame", row.names = c(NA, -8L))

In this dataset, we see that there is column person_number=938541 who have only one date 18.01 in the columns dt start and dt end. This means that on this date he works alone (do not pay attention to duplicates in the date, this is not important now). But in general, we see that whe have 4 unique person_number (persons).

How to do that if it so happened that on some date one person works, then we are looking for a all unique personal_number, and we sign randomly anyone on the date where work just one personal_number. For example desired result like this

SAP code           dt_start                    dt_end         person_number
S985    50000690    18.01.2022 8:00     18.01.2022 10:00    938541
S985    50000690    18.01.2022 12:00    18.01.2022 22:00    938541
S985    50000690    18.01.2022 12:00    18.01.2022 22:00    938541
S985    50000690    18.01.2022 10:00    18.01.2022 20:00    938541
**S985  50000690    18.01.2022 12:00    18.01.2022 22:00    865394**
S985    801         22.01.2022 10:00    22.01.2022 20:00    754105
S985    50000690    22.01.2022 10:00    22.01.2022 20:00    867801
S985    50000690    22.01.2022 12:00    22.01.2022 22:00    865394
S985    50000690    22.01.2022 12:00    22.01.2022 22:00    938541

randomly i chose 865394 and pasted it on 18.01 date.So on 18.01 date two people work instead of one. Your help will be appreciated. Thank you.

CodePudding user response：

Both options choose the first dt_start/dt_end to which the randomly-assigned person_number is assigned to augment a single-worker.

Side note: your timestamps should likely be "real" POSIXt-class, for two reasons: (1) sort-order will be wrong now, since you have day-month-year; (2) if you want to calculate differences or such, you need to be number-like. Neither of my solutions is doing this, a quick as.POSIXct(., format="%d.%m.%Y %H:%M") on each will work. Both of the solutions work as-is if the columns are character or POSIXt.

dplyr

library(dplyr)
persons <- unique(mydata$person_number)
mydata %>%
  group_by(date = substring(dt_start, 1, 10)) %>%
  summarize(
    n = n_distinct(person_number),
    across(-person_number, ~ .[1]),
    person_number = sample(setdiff(persons, person_number))[1]
  ) %>%
  filter(n == 1L) %>%
  select(-n, -date) %>%
  bind_rows(mydata)
# # A tibble: 9 x 5
#   SAP       code dt_start         dt_end           person_number
#   <chr>    <int> <chr>            <chr>                    <int>
# 1 S985  50000690 18.01.2022 8:00  18.01.2022 10:00        754105
# 2 S985  50000690 18.01.2022 8:00  18.01.2022 10:00        938541
# 3 S985  50000690 18.01.2022 12:00 18.01.2022 22:00        938541
# 4 S985  50000690 18.01.2022 12:00 18.01.2022 22:00        938541
# 5 S985  50000690 18.01.2022 10:00 18.01.2022 20:00        938541
# 6 S985       801 22.01.2022 10:00 22.01.2022 20:00        754105
# 7 S985  50000690 22.01.2022 10:00 22.01.2022 20:00        867801
# 8 S985  50000690 22.01.2022 12:00 22.01.2022 22:00        865394
# 9 S985  50000690 22.01.2022 12:00 22.01.2022 22:00        938541

data.table

library(data.table)
myDT <- as.data.table(mydata) # just for local testing, should use setDT(mydata)
persons <- unique(mydata$person_number)
rbindlist(list(
  myDT,
  myDT[, date := substring(dt_start, 1, 10)
    ][, cbind(.SD[1,], n = uniqueN(person_number), person_number = sample(setdiff(persons, person_number))[1]),
        by = date, .SDcols = -"person_number"
    ][ n == 1L,][, n := NULL]
  ), use.names = TRUE)
#       SAP     code         dt_start           dt_end person_number       date
#    <char>    <int>           <char>           <char>         <int>     <char>
# 1:   S985 50000690  18.01.2022 8:00 18.01.2022 10:00        938541 18.01.2022
# 2:   S985 50000690 18.01.2022 12:00 18.01.2022 22:00        938541 18.01.2022
# 3:   S985 50000690 18.01.2022 12:00 18.01.2022 22:00        938541 18.01.2022
# 4:   S985 50000690 18.01.2022 10:00 18.01.2022 20:00        938541 18.01.2022
# 5:   S985      801 22.01.2022 10:00 22.01.2022 20:00        754105 22.01.2022
# 6:   S985 50000690 22.01.2022 10:00 22.01.2022 20:00        867801 22.01.2022
# 7:   S985 50000690 22.01.2022 12:00 22.01.2022 22:00        865394 22.01.2022
# 8:   S985 50000690 22.01.2022 12:00 22.01.2022 22:00        938541 22.01.2022
# 9:   S985 50000690  18.01.2022 8:00 18.01.2022 10:00        754105 18.01.2022

This could also benefit from using %>% for piping to better show the flow of data (i.e., readability), I chose against it to not blur the packages/dialects.

This alternative (more concise) is derived from Henrik's awesome suggestion:

myDT[, date := substring(dt_start, 1, 10)
  ][ , if (uniqueN(person_number) == 1) {
         .SD[c(1, 1:.N)][1, person_number := sample(setdiff(persons, person_number), 1)]
       } else .SD, by = date]