Home > Back-end >  How to bootstrap dataset in R which is blocked by a factor?
How to bootstrap dataset in R which is blocked by a factor?

Time:01-21

I want to perform bootstrap on this data set. Notice that the data has two factors: replicate and level, and two variables high.density and low.density that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.

replicate level high.density low.density
    1     low    14          36
    1     low    54          31
    1     mid    82          10
    1     mid    24          NA
    2     low    12          28
    2     low    11          45
    2     mid    12          17
    2     mid    NA          24
    2      up    40          10
    2      up    NA           5
    2      up    20           2

For instance, in replicate/ level: 1/low the low.density 31 and 36 can be interchanged (or high.density interchanged) so the head of that dataset may look like:

replicate level high.density low.density
    1     low    14          31
    1     low    54          36
    1     mid    82          10
    1     mid    24          NA

I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.

##DATA FRAME (credits: caldwellst)

    df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
    
    df$replicate <- as.factor(as.numeric(df$replicate))
    df$level <- as.factor(as.character(df$level)

)

CodePudding user response:

Here's a solution using dplyr, purrr, and tidyr. First nest the numeric data, and then sample each of the unique combinations of replicate and level in the data. Then within those, bootstrap the unique values of the densities and then unnest for final data frame.

# library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)

df %>%
  nest(data = ends_with("density")) %>%
  slice_sample(n = 500, replace = TRUE) %>%
  mutate(data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))) %>%
  unnest(cols = data)
#> # A tibble: 500 × 4
#>    replicate level high.density low.density
#>        <dbl> <chr>        <dbl>       <dbl>
#>  1         1 low             54          31
#>  2         2 mid             12          24
#>  3         1 mid             24          10
#>  4         2 up              20           2
#>  5         2 mid             12          24
#>  6         2 mid             12          24
#>  7         1 mid             82          10
#>  8         2 up              NA           2
#>  9         1 low             14          36
#> 10         2 mid             12          17
#> # … with 490 more rows

Data

df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), 
    level = c("low", "low", "mid", "mid", "low", "low", "mid", 
    "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 
    12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
    cols = list(replicate = structure(list(), class = c("collector_double", 
    "collector")), level = structure(list(), class = c("collector_character", 
    "collector")), high.density = structure(list(), class = c("collector_double", 
    "collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

CodePudding user response:

We may exploit split and do the sampling according to unique combinations of replicate and level. We could repeat this process B times.

df_shuffle <- function(DF) {
  my_split <- split(DF, f = ~ DF$replicate   DF$level)
  shuffle <- lapply(my_split, \(x) {
    nrX <- nrow(x)
    cbind(x[, c('replicate', 'level')],
          high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
          low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
  })
  DF_new <- do.call(rbind, shuffle)
  rownames(DF_new) <- NULL
  return(DF_new)
}

B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
   replicate level high.density low.density
1          1   low           54          36
2          1   low           54          36
3          2   low           12          45
4          2   low           12          28
5          1   mid           24          10
6          1   mid           82          10
7          2   mid           NA          17
8          2   mid           12          17
9          2    up           20          10
10         2    up           40          10
11         2    up           20           5

Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.

# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])

Finally, we perform a linear regression on each shuffled dataset and store the B coefficients in out.

row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
  lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)

## out <- row_bind(
##   lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
##                            family = binomial())$coef)
## )

# -------------------------------------------------------------------
> dim(out)
[1] 1000    2

Output

> head(out)
  X.Intercept. low.density
1     13.74881   0.2804738
2     20.01074  -0.2095672
3     30.26643  -0.2946373
4     29.19541  -0.2752761
5     37.76273  -0.4555651
6     37.72250  -0.1548349

enter image description here

The code required to create this image can be found here.

  •  Tags:  
  • Related