Which is the fastest manner to the derive the conditional minimum value of an R data frame column?-CodePudding

Suppose we have this data frame:

> data
  ID Period_1 Values
1  1  2020-03     -5
2  1  2020-04     25
3  2  2020-01     35
4  2  2020-02     45
5  2  2020-03     55
6  2  2020-04     87
7  3  2020-02     10
8  3  2020-03     20
9  3  2020-04     30

data <- 
  data.frame(
    ID = c(1,1,2,2,2,2,3,3,3),
    Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04", "2020-02", "2020-03", "2020-04"),
    Values = c(-5, 25, 35, 45, 55, 87, 10, 20, 30)
  )

I would like to extract the minimum of "Values", but subject to the condition that a Period_1 condition is met (such as Period_1 == "2020-04"). My inclination is to use dplyr group_by(Period_1) %>% but I don't need the minimum for all Period_1 groupings, I just need the minimum of Values for the single specified period. The actual database I am working with has 2 million rows and I suspect my abundant use of group_by(...) is slowing things down dramatically.

Other Stack Overflow (and Google, etc.) posts I reviewed also rely on group_by, maybe this is the quickest way to process this, I don't know, but I suspect not.

I tried the following but it didn't work: data %>% select(where(data$Period_1 == "2020-04"))%>% min(data$Values, na.rm=TRUE), returning the message "Error: Can't convert a logical vector to function"

Processing speed-wise, which is the fastest way to extract a conditional minimum? Including by use of dplyr.

CodePudding user response：

You are confusing dplyr::filter with dplyr::select. select(where(condition)) selects columns based on a logical condition that is aplied to the whole vector/column, as in select(where(is.numeric)), which selects numeric columns.

To select rows that meet a condition, use filter.

library(dplyr)

data %>%
   filter(Period1 == "2020-04") %>%
   pull(Values) %>%
   min(na.rm = TRUE)

# OR with `summarise`

data %>%
   filter(Period1 == "2020-04") %>%
   summarise(min_Values = min(Values, na.rm = TRUE))

CodePudding user response：

Here is a base R option (if you are looking for speed). We can subset the data, then get the minimum value for the third (i.e., Values) column.

min(data[data$Period_1 == "2020-04", ][,3], na.rm = TRUE)

# [1] 25

Benchmark

CodePudding user response：

Here is another base R function, that follows the lines of Andrew's answer but is faster for small to medium data sets. Then the 2 base R and GuedesBF dplyr solutions are timed and the timings plotted.

data <- 
  data.frame(
    ID = c(1,1,2,2,2,2,3,3,3),
    Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04", "2020-02", "2020-03", "2020-04"),
    Values = c(-5, 25, 35, 45, 55, 87, 10, 20, 30)
  )

library(dplyr)
library(ggplot2)
library(microbenchmark)

f1 <- function(data, period){
  i <- data[["Period_1"]] == period
  min(data$Values[i], na.rm = TRUE)
}
f2 <- function(data, period){
  min(data[data$Period_1 == period, 3], na.rm = TRUE)
}
f3 <- function(data, period){
  data %>%
    filter(Period_1 == period) %>%
    pull(Values) %>%
    min(na.rm = TRUE)
}

funTest <- function(n, X = data, period = "2020-04"){
  out <- lapply(seq.int(n), function(k){
    y <- X
    for(i in seq.int(k)) y <- rbind(y, y)
    mb <- microbenchmark(
      base_Rui = f1(y, period),
      base_Andrew = f2(y, period),
      dplyr_GuedesBF = f3(y, period)
    )
    mb$nrow <- nrow(y)
    aggregate(time ~ expr   nrow, mb, median)
  })
  do.call(rbind, out)
}

timings <- funTest(20)

ggplot(timings, aes(nrow, time, color = expr))  
  geom_line()  
  geom_point()  
  scale_x_continuous(trans = "log10")  
  scale_y_continuous(trans = "log10")  
  theme_bw()

^{Created on 2022-02-03 by the reprex package (v2.0.1)}

CodePudding user response：

Have you considered the grep() function