I have a dataframe of timeseries data, df1, that I need to extract a number of 'windows' from in R. The start- and end-points for the windows I need are in two columns of a separate dataframe, df2. The values for the start- and end-points correspond to the rownumbers of the windows required.
In the example below I am part of the way to a solution but currently only the first window is extracted. How do I amend this example to extract all four windows? Could this be a case for purrr?
library(tidyverse)
# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100))
# dataframe of windows (i.e. row number IDs) to extract from data
df2 <-tibble::tribble(
~window_start, ~window_end,
3L, 10L,
21L, 25L,
52L, 63L,
78L, 90L
)
# extracted data
df3 <- df1 %>%
slice(df2$window_start : df2$window_end)
(NB. I am aware there is a similar question here - Subset a dataframe using start and stop points from another dataframe? - but my actual data is very large and I am curious whether a non-merge-based solution will be quicker.)
CodePudding user response:
Maybe try this approach with purrr::map2
# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100, mean = 45, sd = 30) %>% abs())
# dataframe of windows (i.e. row number IDs) to extract from data
df2 <-tibble::tribble(
~window_start, ~window_end,
3L, 10L,
21L, 25L,
52L, 63L,
78L, 90L
)
subset_thats_in <- function(mini, maxi){
df1 %>%
filter(between(my_values, mini, maxi))
}
purrr::map2(df2$window_start,
df2$window_end,
subset_thats_in)
[[1]]
# A tibble: 4 × 1
my_values
<dbl>
1 6.47
2 8.69
3 7.73
4 7.35
[[2]]
# A tibble: 12 × 1
my_values
<dbl>
1 24.2
2 22.9
3 22.4
4 24.4
5 22.6
6 21.7
7 23.2
8 21.3
9 23.3
10 21.1
11 23.5
12 22.6
[[3]]
# A tibble: 10 × 1
my_values
<dbl>
1 54.0
2 61.4
3 62.5
4 60.8
5 60.5
6 55.5
7 61.4
8 59.0
9 57.9
10 53.3
[[4]]
# A tibble: 6 × 1
my_values
<dbl>
1 87.8
2 79.1
3 80.5
4 82.7
5 85.2
6 80.6
CodePudding user response:
We can use map2
library(tidyverse)
map2(df2[[1]], df2[[2]], ~ df1[.x:.y, ])
#> [[1]]
#> # A tibble: 8 × 1
#> my_values
#> <dbl>
#> 1 1.33
#> 2 1.27
#> 3 0.415
#> 4 -1.54
#> 5 -0.929
#> 6 -0.295
#> 7 -0.00577
#> 8 2.40
#>
#> [[2]]
#> # A tibble: 5 × 1
#> my_values
#> <dbl>
#> 1 -0.224
#> 2 0.377
#> 3 0.133
#> 4 0.804
#> 5 -0.0571
#>
#> [[3]]
#> # A tibble: 12 × 1
#> my_values
#> <dbl>
#> 1 -0.377
#> 2 2.44
#> 3 -0.795
#> 4 -0.0549
#> 5 0.250
#> 6 0.618
#> 7 -0.173
#> 8 -2.22
#> 9 -1.26
#> 10 0.359
#> 11 -0.0110
#> 12 -0.941
#>
#> [[4]]
#> # A tibble: 13 × 1
#> my_values
#> <dbl>
#> 1 -0.118
#> 2 -0.912
#> 3 -1.44
#> 4 -0.797
#> 5 1.25
#> 6 0.772
#> 7 -0.220
#> 8 -0.425
#> 9 -0.419
#> 10 0.997
#> 11 -0.276
#> 12 1.26
#> 13 0.647
Or dplyr creatively.
df2 %>%
rowwise() %>%
transmute(windows = list(c_across(starts_with("window")) %>% {df1[.[[1]]:.[[2]], ]}))
#> # A tibble: 4 × 1
#> # Rowwise:
#> windows
#> <list>
#> 1 <tibble [8 × 1]>
#> 2 <tibble [5 × 1]>
#> 3 <tibble [12 × 1]>
#> 4 <tibble [13 × 1]>
Created on 2022-01-09 by the reprex package (v2.0.1)
data:
set.seed(0)
# dataframe of data to subset
df1 <- tibble(my_values = rnorm(100))
# dataframe of windows (i.e. row number IDs) to extract from data
df2 <- tibble::tribble(
~window_start, ~window_end,
3L, 10L,
21L, 25L,
52L, 63L,
78L, 90L
)
CodePudding user response:
You could use mapply:
df1[unlist(mapply(function(x,y) x:y, df2$window_start, df2$window_end)),]
# A tibble: 38 x 1
my_values
<dbl>
1 0.671
2 -0.617
3 -0.354
4 2.76
5 0.382
6 -0.488
7 0.889
8 -1.32
9 0.328
10 0.779
# ... with 28 more rows
CodePudding user response:
purrr is quite memory efficient for such data transformations. However, a list of length 10000 may still be beefy if data is to be copied.
x = vector(mode = "list", 10000L)
x = purrr::transpose(df2) |> lapply(function(x) df1[x[1]:x[2],])
as.numeric on a transposed list gets the range, which can be used to subset df1.
For larger sets, it might be useful to try a vectorized approach. Below is a base R option, tweak SIMPLIFY = TRUE to reduce it to a vector, in case you are working with a single column.
f = Vectorize(\(x, y) df1[seq.int(x, y),], SIMPLIFY = F)
f(df2[[1]], df2[[2]])
CodePudding user response:
A simple base R solution would be using the sequence function to generate all the row indices you need to subset the dataframe. The first argument of sequence specifies the length of the sequence to be generated each starting from a number given in the second argument. This should be very efficient since the only thing the function does is creating a sequence of integers.
df1[sequence(df2$window_end - df2$window_start 1L, df2$window_start), ]
Output
> set.seed(1234L)
> df1 <- tibble(my_values = rnorm(100))
> df1[sequence(df2$window_end - df2$window_start 1L, df2$window_start), ]
# A tibble: 38 x 1
my_values
<dbl>
1 1.08
2 -2.35
3 0.429
4 0.506
5 -0.575
6 -0.547
7 -0.564
8 -0.890
9 0.134
10 -0.491
# ... with 28 more rows
You can also get minor performance improvement by using data.table:::vecseq. The code is very similar to the above:
df1[data.table:::vecseq(df2$window_start, df2$window_end - df2$window_start 1L, NULL), ]
Further reading: data.table efficient recycling V2
