I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K rows and 50 analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m % months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
CodePudding user response:
I'm having trouble figuring out what your intended outcome is supposed to be; is this example heading in the right direction?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m % months(seq(0, 12, by = 6))
test_func <- function(ind){
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[ind] | DateFin < vec_date[ind], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[ind], as.numeric(DateFin - vec_date[ind])*Pr_day,
if_else(DateIn > vec_date[ind], Premium, 0))) |>
filter(Aux1 == "Y")
}
list_of_results <- lapply(seq_along(vec_date), test_func)
names(list_of_results) <- vec_date
list_of_results
#> $`2021-12-31`
#> # A tibble: 2 × 7
#> DateIn DateFin Premium duration Pr_day Aux1 test1
#> <date> <date> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 2021-10-21 2022-03-10 11000 140 78.6 Y 5421.
#> 2 2021-12-25 2022-07-12 5000 199 25.1 Y 4849.
#>
#> $`2022-06-30`
#> # A tibble: 2 × 7
#> DateIn DateFin Premium duration Pr_day Aux1 test1
#> <date> <date> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 2021-12-25 2022-07-12 5000 199 25.1 Y 302.
#> 2 2022-05-11 2023-02-15 24500 280 87.5 Y 20125
#>
#> $`2022-12-31`
#> # A tibble: 1 × 7
#> DateIn DateFin Premium duration Pr_day Aux1 test1
#> <date> <date> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 2022-05-11 2023-02-15 24500 280 87.5 Y 4025
Created on 2022-01-27 by the reprex package (v2.0.1)
CodePudding user response:
As @Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the liberty of renaming some columns, and simplifying the control-flow based calculation according to my understanding of the problem, based on what you shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m % months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025
