How to randomly sample multiple consecutive rows of a dataframe in R?-CodePudding

I've a dataframe with 100 rows and 20 columns and want to randomly sample 5 times 10 consecutive rows, e.g. 10:19, 25:34, etc. With: sample_n( df, 5 ) I'm able to extract 5 unique, randomly sampled rows, but don't know how to sample consecutive rows. Any help? Thanks!

CodePudding user response：

df <- mtcars
df$row_nm <- seq(nrow(df))

set.seed(7)

sample_seq <- function(n, N) {
  i <- sample(seq(N), size = 1)
  
  ifelse(
    test = i   (seq(n) - 1) <= N,
    yes = i   (seq(n) - 1),
    no = i   (seq(n) - 1) - N
  )
}

replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))

# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb row_nm
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     10
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     11
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3     12
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3     13
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3     14
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     15
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     16
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     17
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1     18
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2     19
#> 
#> [[2]]
#>                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb row_nm
#> Honda Civic      30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2     19
#> Toyota Corolla   33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1     20
#> Toyota Corona    21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     21
#> Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2     22
#> AMC Javelin      15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2     23
#> Camaro Z28       13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     24
#> Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2     25
#> Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1     26
#> Porsche 914-2    26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     27
#> Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     28
#> 
#> [[3]]
#>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb row_nm
#> Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8     31
#> Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     32
#> Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      1
#> Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4      2
#> Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      3
#> Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1      4
#> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2      5
#> Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      6
#> Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4      7
#> Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      8
#> 
#> [[4]]
#>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb row_nm
#> Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     28
#> Ford Pantera L    15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4     29
#> Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     30
#> Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8     31
#> Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     32
#> Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      1
#> Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4      2
#> Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      3
#> Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1      4
#> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2      5
#> 
#> [[5]]
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb row_nm
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4      7
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      8
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2      9
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     10
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     11
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3     12
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3     13
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3     14
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     15
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     16

^{Created on 2022-01-24 by the reprex package (v2.0.1)}

CodePudding user response：

You could so something like:

#sample data
df <- data.table(value = 1:100000)

#function which sampled consecutive rows (x = dataframe, rows = nr of consecutive rows, nr = amount of times you want to sample consecutive rows)
sample_fun <- function(x, rows, nr){
  #maximum number which can be sampled
  numbers <- 1:(nrow(x) - rows)

  #randomly sample 5 numbers
  sampled.numbers <- sample(numbers, nr)
    
  #convert to vector (5 consecutive)
  sampled.rows <- lapply(sampled.numbers, function(x){seq(x, (x rows-1), 1)})
  sampled.rows <- do.call(c, sampled.rows)
  
  #sample and return
  result <- x[sampled.rows,]
  return(result)
}  

sample_fun(x = df, rows = 5, nr = 2)

CodePudding user response：

You don't mention if this can include replacement (i.e. if you sample 10:19, can you then also sample 15:24?). You also don't mention if you can sample anything above row 91, which would mean that sample of 10 gets cut off (i.e. 98,99,100 would only be 3 consecutive rows unless you want that to loop back to row 1). Assuming you can sample any value with replacement, the solution can be done in one line:

sapply(sample(1:100,5),function(x){seq(x,x 9)})

This applies the sequence function to each of 5 individually sampled numbers. The output will be a matrix, where each column is a sample of 10 consecutive rows, but as noted, these will potentially overlap, or go above 100.

If you want a solution where the rows will not overlap at all, and avoiding values over 100, without making values above 91 less likely to be sampled, this actually gets kind of trick, but I think the code below should work. You cant just sample from 1:91 without affect probability of your random sample, because then this means a value like 100 actually only has a 1/91 probability of being sampled (the sample value has to be 91), where other values don't involve that same constraint. This solution makes it so all rows are equally likely to be sampled.

Rows=c(1:100,1:100)


SampleRows=matrix(0,nrow=10,ncol=5)


for(i in 1:ncol(SampleRows)){
  SampledValue=sample(Rows,1)
  RowsIndex=min(which(Rows==SampledValue))
  Sequence=Rows[RowsIndex:(RowsIndex 9)]
  SampleRows[,i]=Sequence
  Rows=Rows[!(Rows %in% Sequence)]
}

This approach creates a vector that sequences from 1:100, repeated twice (variable Rows), you'll see why this is important in a bit. For each of 5 iterations (corresponding to 5 samples), we take a sampled value from Rows, which will be a number 1:100, we then figure out where that number is in Rows, and take all 9 values next to it. In the first sample this will always be 10 consecutive numbers (e.g. 20:29). But then we remove those sampled values from Rows. If we happen to get the next sample as a value that would lead to overlap (like 18), then instead it samples (18,19,30,31,32,33,34...) since 20:29 have been removed. We need to do 1:100 twice in Rows, so that if we sample a value like 99, it resets from 100, back to 1.

If you want your output in a vector,throw in this at the end

sort(as.vector(SampleRows))

Let me know if this works for the needs of your problem.