I have data with information about calls (about 3 million rows).
caller
user_1
user_2
user_3
user_N
I need to create one more column with a random timestamp for each user call, i.e. I want to get something like this:
| caller | timestamp |
|---|---|
| user_1 | 2019-12-24 21:00:07 |
| user_2 | 2019-12-27 20:03:19 |
| user_3 | 2020-01-11 19:30:54 |
| user_N | 2020-02-15 22:37:12 |
Due to restrictions, the time can only be between 18:00:00 and 23:59:59 and dates must be in the range from Jan 1, 2019 to Jan 1, 2021.
Is it possible to implement this in R? Perhaps there are some functions that can be useful here?
I would be grateful for any help!
CodePudding user response:
Given data frame with id's:
df <- data.frame(caller = 1:3E6)
You could run
df$timestamp = as.POSIXct("2019-01-01 00:00", tz = "GMT")
floor(runif(nrow(df), max = 365))*24*60*60
runif(nrow(df), min = 18*60*60, max = 24*60*60)
which would add a uniform random number of days, and a random number of seconds between 18 and 24 hours' worth.
We can verify that the timestamps occur in the desired range:
range(df$timestamp)
range(lubridate::hour(df$timestamp) lubridate::minute(df$timestamp)/60)
CodePudding user response:
One approach of generating random timestamps in a range is by generating a sequence of all possible timestamp in the range by using seq function, and then randomly select n timestamps from them by using sample function. For example if you want to generate 3 random timestamps between Jan 1, 2021 and Jan 3, 2021, in the unit of second, you can do:
set.seed(1)
seq(as.POSIXct("2021-01-01 00:00:00") ,as.POSIXct("2021-01-03 23:59:59"), by = "s") |>
sample(3)
#[1] "2021-01-01 06:46:27 07" "2021-01-03 04:56:32 07"
#[3] "2021-01-02 10:33:32 07"
Note: You can specify your own time zone by using tz in as.POSIXct function.
By this approach, you can get 3 million random timestamps by the following steps:
- Set the start and the end of the daily range to
18:00:00and23:59:59, respectively.
starts <- seq(as.POSIXct("2019-01-01 18:00:00"), as.POSIXct("2021-01-01 18:00:00"),
by = "days")
ends <- seq(as.POSIXct("2019-01-01 23:59:59"), as.POSIXct("2021-01-01 23:59:59"),
by = "days")
- Calculate the number of samples for each day
ndays = length(starts)
n = 3e6/ndays
- Randomly select n samples from all possible timestamps on each day, and the store the samples in a list.
sampled_timestamps <- vector("list", ndays)
for (k in 1:ndays) {
sampled_timestamps[[k]] <- seq(starts[k], ends[k], by = "hours") |>
sample(n)
}
- Convert the
sampled_timestampsto a vector to be able to use it as a column in a data frame.
v_sampled_timestamps <- do.call("c", sampled_timestamps)
Now you can use v_sampled_timestamps to fill in the values of the timestamps column in your data frame.
your_df$timestamps <- v_sampled_timestamps
