Random timestamp generation in R-CodePudding

I have data with information about calls (about 3 million rows).

caller
user_1   
user_2   
user_3   
user_N

I need to create one more column with a random timestamp for each user call, i.e. I want to get something like this:

caller	timestamp
user_1	2019-12-24 21:00:07
user_2	2019-12-27 20:03:19
user_3	2020-01-11 19:30:54
user_N	2020-02-15 22:37:12

Due to restrictions, the time can only be between 18:00:00 and 23:59:59 and dates must be in the range from Jan 1, 2019 to Jan 1, 2021.

Is it possible to implement this in R? Perhaps there are some functions that can be useful here?
I would be grateful for any help!

CodePudding user response：

Given data frame with id's:

df <- data.frame(caller = 1:3E6)

You could run

df$timestamp = as.POSIXct("2019-01-01 00:00", tz = "GMT")   
   floor(runif(nrow(df), max = 365))*24*60*60   
   runif(nrow(df), min = 18*60*60, max = 24*60*60)

which would add a uniform random number of days, and a random number of seconds between 18 and 24 hours' worth.

We can verify that the timestamps occur in the desired range:

range(df$timestamp)
range(lubridate::hour(df$timestamp)   lubridate::minute(df$timestamp)/60)

CodePudding user response：

One approach of generating random timestamps in a range is by generating a sequence of all possible timestamp in the range by using seq function, and then randomly select n timestamps from them by using sample function. For example if you want to generate 3 random timestamps between Jan 1, 2021 and Jan 3, 2021, in the unit of second, you can do:

set.seed(1)
seq(as.POSIXct("2021-01-01 00:00:00") ,as.POSIXct("2021-01-03 23:59:59"), by = "s") |> 
sample(3)

#[1] "2021-01-01 06:46:27  07" "2021-01-03 04:56:32  07"
#[3] "2021-01-02 10:33:32  07"

Note: You can specify your own time zone by using tz in as.POSIXct function.

By this approach, you can get 3 million random timestamps by the following steps:

Set the start and the end of the daily range to 18:00:00 and 23:59:59, respectively.

starts <- seq(as.POSIXct("2019-01-01 18:00:00"), as.POSIXct("2021-01-01 18:00:00"), 
       by = "days")
ends <- seq(as.POSIXct("2019-01-01 23:59:59"), as.POSIXct("2021-01-01 23:59:59"), 
       by = "days")

Calculate the number of samples for each day

ndays = length(starts)
n = 3e6/ndays

Randomly select n samples from all possible timestamps on each day, and the store the samples in a list.

sampled_timestamps <- vector("list", ndays)
for (k in 1:ndays) {
      sampled_timestamps[[k]] <- seq(starts[k], ends[k], by = "hours") |>
      sample(n)
}

Convert the sampled_timestamps to a vector to be able to use it as a column in a data frame.

v_sampled_timestamps <- do.call("c", sampled_timestamps)

Now you can use v_sampled_timestamps to fill in the values of the timestamps column in your data frame.

your_df$timestamps <- v_sampled_timestamps