iterate which function over vector of numbers-CodePudding

I have two vectors of numbers. I would like to find the first position in the larger vector that is greater than each of the elements in the smaller vector. This is achievable with the which function and a loop but I was hoping to achieve this with one simple line of code.

example

small_vector <- c(555.8, 457.4, 392.3, 294.24, 228.80, 163.44, 98.08)

large_vector <- c(20.1, 14.8, 23.3, 38.1, 40.2, 48.7, 48.7 , 54.0 ,60.5 ,64.6 , 85.8, 93.1, 99.5, 115.4, 133.4, 158.8, 206.4, 289.1, 353.7, 428.9, 558.0, 661.8)

which.max(small_vector <= large_vector)

I would like the return to be something like:

[1] 21 21 20 19 18 17 13

CodePudding user response：

sapply(small_vector, function(S) which.max(S <= large_vector))
# [1] 21 21 20 19 18 17 13

CodePudding user response：

You could also use outer with max.col:

max.col(outer(small_vector, large_vector, '<='), 'first')
[1] 21 21 20 19 18 17 13

EDIT:

the solutions above are not feasible when the vectors are large. Hence the need for edit. THE feasible solution will be using the data.table frank as it is faster than rank from base R according to the documentation:

library(data.table)
n <- seq_along(small_vector)
x <- frank(c(small_vector, large_vector), ties.method = 'first')[n]
y <- frank(small_vector, ties.method = 'first') - 1
x - y
[1] 21 21 20 19 18 17 13

benchmarking the above:

f_sapply <-  function(){
  sapply(small_vector, function(S) which.max(S <= large_vector))
}

f_datatable_rank <- function(){
  n <- seq_along(small_vector)
  x <- frank(c(small_vector, large_vector), ties.method = 'first')[n]
  y <- frank(small_vector, ties.method = 'first') - 1
  x - y
}

f_baseR_rank <- function(){
  n <- seq_along(small_vector)
  x <- rank(c(small_vector, large_vector), ties.method = 'first')[n]
  y <- rank(small_vector, ties.method = 'first') - 1
  x - y
}

f_outer <- function(){
  max.col(outer(small_vector, large_vector, '<='), 'first')
}


a <- microbenchmark(f_sapply(), f_outer(), f_baseR_rank(), f_datatable_rank(),
                    times = 10,
                    check = 'equal')
Unit: microseconds
               expr        min         lq        mean     median         uq        max neval
         f_sapply()  19654.007  21044.502  27836.2294  30436.816  31303.824  32598.046    10
          f_outer() 188230.644 194679.142 197723.3888 196696.690 199794.034 210718.518    10
     f_baseR_rank()    451.830    469.334    544.6752    540.263    550.291    792.433    10
 f_datatable_rank()   1081.983   1335.431   1513.4631   1509.925   1772.672   1834.301    10

If you increase the vectors, the benchmark will fail because of the outer and sapply but rank will work

CodePudding user response：

This should scale well with larger vectors:

(which(order(c(small_vector, cummax(large_vector))) <= length(small_vector)) - 0:(length(small_vector) - 1))[rank(small_vector)]

#> [1] 21 21 20 19 18 17 13

Some benchmarking:

set.seed(3)
library(data.table)
small_vector <- runif(700)
large_vector <- runif(2200)

microbenchmark::microbenchmark(sapply1 = sapply(small_vector, function(S) which.max(S <= large_vector)),
                               outer = max.col(outer(small_vector, large_vector, '<='), 'first'),
                               rank = {large_vector <- cummax(large_vector); sapply(small_vector, function(S) rank(c(S, large_vector), ties.method = "first")[1])},
                               frank = frank(c(small_vector, cummax(large_vector)), ties.method = 'first')[seq_along(small_vector)] - frank(small_vector, ties.method = 'first')   1L,
                               order = (which(order(c(small_vector, cummax(large_vector))) <= length(small_vector)) - 0:(length(small_vector) - 1))[rank(small_vector)])
#> Unit: microseconds
#>     expr      min        lq       mean    median        uq      max neval
#>  sapply1   9651.7  10174.70  11380.203  10540.65  11425.35  25423.2   100
#>    outer  46033.0  50234.80  55730.100  53168.30  56102.05  96617.0   100
#>     rank 201767.7 208193.10 214184.486 214575.85 217909.25 244915.6   100
#>    frank   1469.9   1784.35   1982.927   1946.75   2122.35   4260.7   100
#>    order    358.7    377.05    423.464    411.00    452.95    668.9   100

A side note: I noticed that the sapply and outer solutions incorrectly return 1 for elements in small_vector that are larger than all elements in large_vector, whereas the rank/order solutions will return 1 length(large_vector) (which is straightforward to interpret correctly). Of course, if it's not a possibility, all approaches in the benchmark work.