How to look up date value from same table based on a condition in R?-CodePudding

I am not sure if this task is of self join or not. I am basically trying to lookup the latest date for each State.UnionTerritory in below dataframe where the Daily_confirmed cases for each of them were <= half of current Date.

This will help me to get the the doubling time of cases for each State on each date.

library(tidyverse)
library(lubridate)

df_ind <- read_csv("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/sample_data.csv")

df_ind %>% head()

# output
Date       State.UnionTerritory Daily_confirmed
  <date>     <chr>                          <dbl>
1 2021-12-23 Haryana                           46
2 2021-12-23 Maharashtra                     1179
3 2021-12-23 Delhi                            118
4 2021-12-22 Haryana                           55
5 2021-12-22 Maharashtra                     1201
6 2021-12-22 Delhi                            125

For example Delhi has 118 Cases on 2021-12-23 and less than or half of this for Delhi is coming as 57 on 2021-12-15 so doubling rate would be 2021-12-23 - 2021-12-15 = 8 days.

so I should get something like:

This should be applied for each State in the data & on all dates.

df_ind <- df_ind %>%
  mutate(Daily_confirmed_half = as.integer(Daily_confirmed / 2) )

I am not sure how exactly I can try this to get proper Dates as the Doubling_Date column where Daily_confirmed Cases values met the condition of <= half.

I can group summarize & use first to pull the latest dates but not sure what would be the efficient way of bringing the right dates in another column of this data frame.

CodePudding user response：

I'm not sure if this is your desired output. The approach uses a full_join together with filter to simulate an non-euqal join in dplyr. Then we do some data cleaning and in the last step we need a left_join to our original data, since there are quite a couple of days, where we cannot calculate the doubling date, since its not included in the time series.

library(tidyverse)
library(lubridate)

df_ind %>% 
  group_by(State.UnionTerritory) %>% 
  full_join(., ., by = c("State.UnionTerritory")) %>%
  filter(Date.x > Date.y,
         Daily_confirmed.x > (Daily_confirmed.y * 2)) %>% 
  group_by(Date.x, State.UnionTerritory) %>% 
  filter(Date.y == max(Date.y)) %>% 
  filter(Daily_confirmed.y == max(Daily_confirmed.y)) %>% 
  rename("Date" = Date.x,
         "Daily_confirmed" = Daily_confirmed.x,
         "Doubling_Date" = Date.y) %>% 
  select(- Daily_confirmed.y) %>% 
  mutate(Day_to_double = Date - Doubling_Date) %>% 
  left_join(df_ind,
            .,
            by = c("Date", "State.UnionTerritory", "Daily_confirmed")) %>% 
  arrange(State.UnionTerritory, desc(Date)) 

#> # A tibble: 252 x 5
#>    Date       State.UnionTerritory Daily_confirmed Doubling_Date Day_to_double
#>    <date>     <chr>                          <dbl> <date>        <drtn>       
#>  1 2021-12-23 Delhi                            118 2021-12-15     8 days      
#>  2 2021-12-22 Delhi                            125 2021-12-15     7 days      
#>  3 2021-12-21 Delhi                            102 2021-12-14     7 days      
#>  4 2021-12-20 Delhi                             91 2021-12-14     6 days      
#>  5 2021-12-19 Delhi                            107 2021-12-14     5 days      
#>  6 2021-12-18 Delhi                             86 2021-12-13     5 days      
#>  7 2021-12-17 Delhi                             69 2021-12-13     4 days      
#>  8 2021-12-16 Delhi                             85 2021-12-13     3 days      
#>  9 2021-12-15 Delhi                             57 2021-11-27    18 days      
#> 10 2021-12-14 Delhi                             45 2021-11-15    29 days      
#> # … with 242 more rows

^{Created on 2021-12-25 by the reprex package (v0.3.0)}