I have two vectors of numbers. I would like to find the first position in the larger vector that is greater than each of the elements in the smaller vector. This is achievable with the which function and a loop but I was hoping to achieve this with one simple line of code.
example
small_vector <- c(555.8, 457.4, 392.3, 294.24, 228.80, 163.44, 98.08)
large_vector <- c(20.1, 14.8, 23.3, 38.1, 40.2, 48.7, 48.7 , 54.0 ,60.5 ,64.6 , 85.8, 93.1, 99.5, 115.4, 133.4, 158.8, 206.4, 289.1, 353.7, 428.9, 558.0, 661.8)
which.max(small_vector <= large_vector)
I would like the return to be something like:
[1] 21 21 20 19 18 17 13
CodePudding user response:
sapply(small_vector, function(S) which.max(S <= large_vector))
# [1] 21 21 20 19 18 17 13
CodePudding user response:
You could also use outer with max.col:
max.col(outer(small_vector, large_vector, '<='), 'first')
[1] 21 21 20 19 18 17 13
EDIT:
the solutions above are not feasible when the vectors are large. Hence the need for edit. THE feasible solution will be using the data.table frank as it is faster than rank from base R according to the documentation:
library(data.table)
n <- seq_along(small_vector)
x <- frank(c(small_vector, large_vector), ties.method = 'first')[n]
y <- frank(small_vector, ties.method = 'first') - 1
x - y
[1] 21 21 20 19 18 17 13
benchmarking the above:
f_sapply <- function(){
sapply(small_vector, function(S) which.max(S <= large_vector))
}
f_datatable_rank <- function(){
n <- seq_along(small_vector)
x <- frank(c(small_vector, large_vector), ties.method = 'first')[n]
y <- frank(small_vector, ties.method = 'first') - 1
x - y
}
f_baseR_rank <- function(){
n <- seq_along(small_vector)
x <- rank(c(small_vector, large_vector), ties.method = 'first')[n]
y <- rank(small_vector, ties.method = 'first') - 1
x - y
}
f_outer <- function(){
max.col(outer(small_vector, large_vector, '<='), 'first')
}
a <- microbenchmark(f_sapply(), f_outer(), f_baseR_rank(), f_datatable_rank(),
times = 10,
check = 'equal')
Unit: microseconds
expr min lq mean median uq max neval
f_sapply() 19654.007 21044.502 27836.2294 30436.816 31303.824 32598.046 10
f_outer() 188230.644 194679.142 197723.3888 196696.690 199794.034 210718.518 10
f_baseR_rank() 451.830 469.334 544.6752 540.263 550.291 792.433 10
f_datatable_rank() 1081.983 1335.431 1513.4631 1509.925 1772.672 1834.301 10
If you increase the vectors, the benchmark will fail because of the outer and sapply but rank will work
CodePudding user response:
This should scale well with larger vectors:
(which(order(c(small_vector, cummax(large_vector))) <= length(small_vector)) - 0:(length(small_vector) - 1))[rank(small_vector)]
#> [1] 21 21 20 19 18 17 13
Some benchmarking:
set.seed(3)
library(data.table)
small_vector <- runif(700)
large_vector <- runif(2200)
microbenchmark::microbenchmark(sapply1 = sapply(small_vector, function(S) which.max(S <= large_vector)),
outer = max.col(outer(small_vector, large_vector, '<='), 'first'),
rank = {large_vector <- cummax(large_vector); sapply(small_vector, function(S) rank(c(S, large_vector), ties.method = "first")[1])},
frank = frank(c(small_vector, cummax(large_vector)), ties.method = 'first')[seq_along(small_vector)] - frank(small_vector, ties.method = 'first') 1L,
order = (which(order(c(small_vector, cummax(large_vector))) <= length(small_vector)) - 0:(length(small_vector) - 1))[rank(small_vector)])
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> sapply1 9651.7 10174.70 11380.203 10540.65 11425.35 25423.2 100
#> outer 46033.0 50234.80 55730.100 53168.30 56102.05 96617.0 100
#> rank 201767.7 208193.10 214184.486 214575.85 217909.25 244915.6 100
#> frank 1469.9 1784.35 1982.927 1946.75 2122.35 4260.7 100
#> order 358.7 377.05 423.464 411.00 452.95 668.9 100
A side note: I noticed that the sapply and outer solutions incorrectly return 1 for elements in small_vector that are larger than all elements in large_vector, whereas the rank/order solutions will return 1 length(large_vector) (which is straightforward to interpret correctly). Of course, if it's not a possibility, all approaches in the benchmark work.
