count row if date falls within date range for all dates in series in R-CodePudding

I have a large data frame (~30,000 rows) where I have two date fields "start_date" and "end_date".

I want to summarise the data such that I have 1 column with all the dates and a second column with a count of all the rows in which that date is between the "start_date" and "end_date".

I can make this work using 2 for loops but it is very inefficient as it is going by one though one comparing about 180 dates to 30,000 rows of date ranges.

Below is an example. Say I have the following dataframe.

df <- tibble(
  start_date = c(1,1,2,2,3,3,4,4,5,5),
  end_date = c(2,3,4,5,6,7,8,9,10,11)
)

I want this to output a table/dataframe that looks like this

Date    Count
1       2
2       4
3       5
4       6
5       7
6       6
7       5
8       4
9       3
10      2
11      1

Is there some TidyVerse functions or anything else that could do this transformation efficiently?

CodePudding user response：

Here's a base R method:

date = seq(min(df$start_date), max(df$end_date))
count = sapply(date, \(x) sum(x >= df$start_date & x <= df$end_date))
data.frame(date, count)
#    date count
# 1     1     2
# 2     2     4
# 3     3     5
# 4     4     6
# 5     5     7
# 6     6     6
# 7     7     5
# 8     8     4
# 9     9     3
# 10   10     2
# 11   11     1

CodePudding user response：

Here is a data.table approach using foverlaps. First, create a sequence of desired dates from the minimum start_date to the maximum end_date. Then, create a simple data.table for each of these dates.

Use foverlaps to get the overlapping join between your starting data.frame and the new table. Finally, count up the number of rows after the join for each date.

library(data.table)

setDT(df)
dates <- seq(min(df$start_date), max(df$end_date), by = 1)
dt <- data.table(start_date = dates, end_date = dates, key = c("start_date", "end_date"))
foverlaps(df, dt, which = T)[, .N, by = yid]

Output

In tidyverse you could adapt to the following:

library(tidyverse)

data.frame(date = seq(min(df$start_date), max(df$end_date), by = 1)) %>%
  rowwise() %>%
  mutate(count = sum(date >= df$start_date & date <= df$end_date))