Home > Software engineering >  R: Loops, Dplyr and lubridate, how to combine them
R: Loops, Dplyr and lubridate, how to combine them

Time:01-27

I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.

I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K rows and 50 analysis periods.

I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.

The following is an example of code that works for the first analysis period.

library(tidyverse)
library(lubridate)

df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
             DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
             Premium = c(11000, 5000, 24500))

date <- ymd("2021-12-31")
vec_date <- date %m % months(seq(0, 12, by = 6))

df_new <- df |> 
  mutate(duration = as.numeric(DateFin - DateIn),
         Pr_day = Premium/duration,
         Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
         test1 =  if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
                         if_else(DateIn > vec_date[1], Premium, 0))) 

Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?

Edit: an outline of the format you would need as a result would be:

df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
             DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
             Premium = c(11000, 5000, 24500),
             Aux1 = c("Y", "Y", "N"),
             test1 = c(5421.429, 4849.246, 24500.000),
             Aux2 = c("N", "Y", "Y"),
             test2 = c(0.0000, 301.5075, 20125.0000),
             Aux3 = c("N", "N", "Y"),
             test3 = c(0, 0, 4025))

Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.

CodePudding user response:

I'm having trouble figuring out what your intended outcome is supposed to be; is this example heading in the right direction?

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
             DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
             Premium = c(11000, 5000, 24500))

date <- ymd("2021-12-31")
vec_date <- date %m % months(seq(0, 12, by = 6))

test_func <- function(ind){
  df_new <- df |>
    mutate(duration = as.numeric(DateFin - DateIn),
           Pr_day = Premium/duration,
           Aux1 = if_else(DateIn > vec_date[ind] | DateFin < vec_date[ind], "N", "Y"),
           test1 =  if_else(Aux1 == "Y" & DateFin > vec_date[ind], as.numeric(DateFin - vec_date[ind])*Pr_day,
                            if_else(DateIn > vec_date[ind], Premium, 0))) |>
    filter(Aux1 == "Y")
}

list_of_results <- lapply(seq_along(vec_date), test_func)
names(list_of_results) <- vec_date
list_of_results
#> $`2021-12-31`
#> # A tibble: 2 × 7
#>   DateIn     DateFin    Premium duration Pr_day Aux1  test1
#>   <date>     <date>       <dbl>    <dbl>  <dbl> <chr> <dbl>
#> 1 2021-10-21 2022-03-10   11000      140   78.6 Y     5421.
#> 2 2021-12-25 2022-07-12    5000      199   25.1 Y     4849.
#> 
#> $`2022-06-30`
#> # A tibble: 2 × 7
#>   DateIn     DateFin    Premium duration Pr_day Aux1   test1
#>   <date>     <date>       <dbl>    <dbl>  <dbl> <chr>  <dbl>
#> 1 2021-12-25 2022-07-12    5000      199   25.1 Y       302.
#> 2 2022-05-11 2023-02-15   24500      280   87.5 Y     20125 
#> 
#> $`2022-12-31`
#> # A tibble: 1 × 7
#>   DateIn     DateFin    Premium duration Pr_day Aux1  test1
#>   <date>     <date>       <dbl>    <dbl>  <dbl> <chr> <dbl>
#> 1 2022-05-11 2023-02-15   24500      280   87.5 Y      4025

Created on 2022-01-27 by the reprex package (v2.0.1)

CodePudding user response:

As @Jon Spring suggests in the comments, probably the preferred approach here would be to use tidyr::complete() to extend your data frame, repeating each row in it for each of your analysis dates. Then, you can stick to vectorized calculations and get the analysis date column in the resulting data, too.

Below is how to do just that with the example data you provided. I took the liberty of renaming some columns, and simplifying the control-flow based calculation according to my understanding of the problem, based on what you shared.

First, the example data slightly reframed:

library(tidyverse)
library(lubridate)

policies <- tibble(
  policy_id = seq_len(3),
  start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
  end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
  premium = c(11000, 5000, 24500)
)

policies
#> # A tibble: 3 x 4
#>   policy_id start      end        premium
#>       <int> <date>     <date>       <dbl>
#> 1         1 2021-10-21 2022-03-10   11000
#> 2         2 2021-12-25 2022-07-12    5000
#> 3         3 2022-05-11 2023-02-15   24500

Then, finding remaining prorated premiums for policies at given dates:

start_date <- ymd("2021-12-31")
dates <- start_date %m % months(seq(0, 12, by = 6))

policies %>% 
  mutate(
    days = as.numeric(end - start),
    daily_premium = premium / days
  ) %>% 
  crossing(date = dates) %>% 
  mutate(
    days_left = pmax(0, end - pmax(start, date)),
    premium_left = days_left * daily_premium
  ) %>% 
  select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#>   policy_id date       days_left premium_left
#>       <int> <date>         <dbl>        <dbl>
#> 1         1 2021-12-31        69        5421.
#> 2         1 2022-06-30         0           0 
#> 3         1 2022-12-31         0           0 
#> 4         2 2021-12-31       193        4849.
#> 5         2 2022-06-30        12         302.
#> 6         2 2022-12-31         0           0 
#> 7         3 2021-12-31       280       24500 
#> 8         3 2022-06-30       230       20125 
#> 9         3 2022-12-31        46        4025
  •  Tags:  
  • Related