I am not sure if this task is of self join or not. I am basically trying to lookup the latest date for each State.UnionTerritory in below dataframe where the Daily_confirmed cases for each of them were <= half of current Date.
This will help me to get the the doubling time of cases for each State on each date.
library(tidyverse)
library(lubridate)
df_ind <- read_csv("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/sample_data.csv")
df_ind %>% head()
# output
Date State.UnionTerritory Daily_confirmed
<date> <chr> <dbl>
1 2021-12-23 Haryana 46
2 2021-12-23 Maharashtra 1179
3 2021-12-23 Delhi 118
4 2021-12-22 Haryana 55
5 2021-12-22 Maharashtra 1201
6 2021-12-22 Delhi 125
For example Delhi has 118 Cases on 2021-12-23 and less than or half of this for Delhi is coming as 57 on 2021-12-15 so doubling rate would be 2021-12-23 - 2021-12-15 = 8 days.
so I should get something like:
This should be applied for each State in the data & on all dates.
df_ind <- df_ind %>%
mutate(Daily_confirmed_half = as.integer(Daily_confirmed / 2) )
I am not sure how exactly I can try this to get proper Dates as the Doubling_Date column where Daily_confirmed Cases values met the condition of <= half.
I can group summarize & use first to pull the latest dates but not sure what would be the efficient way of bringing the right dates in another column of this data frame.
CodePudding user response:
I'm not sure if this is your desired output. The approach uses a full_join together with filter to simulate an non-euqal join in dplyr. Then we do some data cleaning and in the last step we need a left_join to our original data, since there are quite a couple of days, where we cannot calculate the doubling date, since its not included in the time series.
library(tidyverse)
library(lubridate)
df_ind %>%
group_by(State.UnionTerritory) %>%
full_join(., ., by = c("State.UnionTerritory")) %>%
filter(Date.x > Date.y,
Daily_confirmed.x > (Daily_confirmed.y * 2)) %>%
group_by(Date.x, State.UnionTerritory) %>%
filter(Date.y == max(Date.y)) %>%
filter(Daily_confirmed.y == max(Daily_confirmed.y)) %>%
rename("Date" = Date.x,
"Daily_confirmed" = Daily_confirmed.x,
"Doubling_Date" = Date.y) %>%
select(- Daily_confirmed.y) %>%
mutate(Day_to_double = Date - Doubling_Date) %>%
left_join(df_ind,
.,
by = c("Date", "State.UnionTerritory", "Daily_confirmed")) %>%
arrange(State.UnionTerritory, desc(Date))
#> # A tibble: 252 x 5
#> Date State.UnionTerritory Daily_confirmed Doubling_Date Day_to_double
#> <date> <chr> <dbl> <date> <drtn>
#> 1 2021-12-23 Delhi 118 2021-12-15 8 days
#> 2 2021-12-22 Delhi 125 2021-12-15 7 days
#> 3 2021-12-21 Delhi 102 2021-12-14 7 days
#> 4 2021-12-20 Delhi 91 2021-12-14 6 days
#> 5 2021-12-19 Delhi 107 2021-12-14 5 days
#> 6 2021-12-18 Delhi 86 2021-12-13 5 days
#> 7 2021-12-17 Delhi 69 2021-12-13 4 days
#> 8 2021-12-16 Delhi 85 2021-12-13 3 days
#> 9 2021-12-15 Delhi 57 2021-11-27 18 days
#> 10 2021-12-14 Delhi 45 2021-11-15 29 days
#> # … with 242 more rows
Created on 2021-12-25 by the reprex package (v0.3.0)

