Error on final object when generating ggplot objects in for loop with dplyr select()-CodePudding

I want to make many plots using multiple pairs of variables in a dataframe, all with the same x. I store the plots in a named list. For simplicity, below is an example with only 1 variable in each plot.

Key to this function is a select() call that is clearly not necessary here but is with my actual data.

The body of the function works fine on each variable, but when I loop through a list of variables, the last one in the list always produces

Error in get(ll): object 'd' not found.

(or whatever the last variable, if not 'd'). Replacing data <- df %>% select(x,ll) with data <- df avoids the error.

## make data
df2 <- data.frame(x = 1:10,
                  a = 1:10,
                  b = 2:11,
                  c = 101:110,
                  d = 10*(1:10))

## make function
testfun <- function(df = df2, vars = letters[1:4]){
  ## initialize list to store plots
  plotlist <- list()
  
  for (ll in vars){
    ## subset data
    data <- df %>% select(x, ll) ## comment out select() to get working function
    # print(data) ## uncomment to check that dataframe subset works correctly
    
    ## plot variable vs. x
    p <- ggplot(data,
           aes(x = x, y = get(ll)))  
      geom_point()  
      ylab(ll)
    
    ## add plot to named list
    plotlist[[ll]] <- p
    # print(p) ## uncomment to see that each plot is being made
  }
  return(plotlist) ## unnecessary, being explicit for troubleshooting
}

## use function
pl <- testfun(df2)
## error ?
pl

I have a work-around that avoids select() by renaming variables in my actual dataframe, but I am curious why this does not work? Any ideas?

CodePudding user response：

The issue is that we cannot use get to access dplyr/tidyverse data in a "programming" paradigm. Instead, we should use non standard evaluation to access the data. I offer a simplified function below (originally I thought it was a function masking issue as I quickly skimmed the question).

testfun <- function(df = df2, vars = letters[1:4]){
 
  
  lapply(vars, function(y) {
    ggplot(df,
           aes(x = x, y = .data[[y]] ))  
      geom_point()  
      ylab(y)
    
  })


}

Calling

plots <- testfun(df2)
plots[[1]]

EDIT

Since OP would like to know what the issue is, I have used a traditional loop as requested

testfun2 <- function(df = df2, vars = letters[1:4]){
  ## initialize list to store plots
  plotlist <- list()
  
  for (ll in vars){
    ## subset data
    d_t <- df %>% select(x, ll) ## comment out select() to get working function
    # print(data) ## uncomment to check that dataframe subset works correctly
    ## plot variable vs. x
    p <- ggplot(d_t,
                aes(x = x, y = .data[[ll]]))  
      geom_point()  
      ylab(ll)
    ## add plot to named list
   plotlist[[ll]] <- p
     ## uncomment to see that each plot is being made
  }
  plotlist

}
pl <- testfun2(df2)
pl[[1]]

The reason get does not work is that we need to use non-standard evaluation as the docs

CodePudding user response：

write a custom function like this

plot_fn<- function(df,y){
  df %>% ggplot(aes(x=x, 
                    y=get(y)) 
          geom_point() 
          ylab(y)
    }

Iterate over plots with purrr:::map

map(letters[1:4],~plot_fn(df=df2,y=.x))

CodePudding user response：

By the time each ggplot evaluates get(ll), the loop has already finished and ll evaluates to "d" for all four ggplots. ll being "d" in the error suggests that it's the final ggplot object that fails, but it's actually the first that causes this error.

get() could work, but not with ll because by the time ll is evaluated it'll always be "d". In our loop we'd like a way to evaluate the ll variable now, and stick that resulting string ("a", "b", "c", or "d") into this code, the rest of which won't run until later. Changing y = get(ll) to y = get(!!ll) is one way to do this: !! performs "surgery" on the unevaluated expression (called a "blueprint for code" in Tidyverse docs) so that the expression passed into ggplot is a string like "a" instead of the variable ll.

testfun <- function(df = df2, vars = letters[1:4]){
  plotlist <- list()
  
  for (ll in vars){
    data <- df %>% select(x, ll)
    
    p <- ggplot(data,
                aes(x = x, y = get(!!ll)))  
                geom_point()  
                ylab(ll)
    
    plotlist[[ll]] <- p
  }
  return(plotlist) ## unnecessary, being explicit for troubleshooting
}

Read on for explanation and an alternate solution.

The loop problem: late binding

Variable scope in R is based on the function it's in: you can have multiple variables with the same name, but only one per function. Here's a way this can trip you up:

vars <- c("a", "b", "c", "d")

(function() {
  results <- list()

  for (ll in vars){
    func <- function () { ll }
    results[[ll]] <- c(ll, func)
  }

  for (vec in results) {
    print(c(vec[[1]], vec[[2]]()))
  }
})()

This outputs

[1] "a" "d"
[1] "b" "d"
[1] "c" "d"
[1] "d" "d"

Each of the four inner functions constructed here use the same outer scope (global, in this case) variable ll which, by the time the functions are actually called after the for loop, is "d". The "late binding" part is the value of the variable at function call time is used, not the value of the variable when the function is defined.

The NSE problem

The OP isn't creating functions though, they're calling ggplot. ggplot does something similar to creating a function: it takes some code that it doesn't evaluate until later. ggplot "captures" code instead of running it. In OP's case, get(ll) isn't evaluated until later.

When this code is evaluated it's in a new context with a "data mask" that allows names of a data frame to be referenced directly. This part is great, it's is what we want — this is what makes get("a") work at all. But evaluating the code later is a problem for the OP: ll in get(ll) evaluate to "d", like get("d") if this code is evaluated after the for loop iteration where ll had the expected value.

Ignoring the data mask part, here's a function called run.later that, like ggplot, doesn't run one of its arguments. When we run that code later, we again find that ll is "d" for all four of the saved expressions.

vars <- c("a", "b", "c", "d")

unevaluated.exprs <- list();

run.later <- function(name, something) {
  expr <- substitute(something)
  unevaluated.exprs[[name]] <<- c(name, expr)
}

(function() {
  for (ll in vars){
    run.later(ll, ll)
  }

  for (vec in unevaluated.exprs) {
    print(c(vec[[1]], eval(vec[[2]])))
  }
})()

this prints

[1] "a" "d"
[1] "b" "d"
[1] "c" "d"
[1] "d" "d"

That's the ll part of the problem. A rule of thumb from languages like Python of "Don't define functions in a loop (if they reference loop variables)" could be generalized for R to "don't define functions, formulas, or otherwise write code that won't be immediately evaluated in a loop (if they reference loop variables)."

Fixing the scope problem instead of metaprogramming

The !! solution provided at the top uses metaprogramming to evaluate the ll variable in the loop instead of evaluating it later.

Theoretically, one could dynamically create variables in each iteration of a loop, then carefully reference that dynamically created variable name with metaprogramming. But a more elegant way would be to use the same variable name but in different scopes. This is what Nithin's answer does with a function: every function creates a new scope, tada, you can use the same variable name in each. Here's another version of that, closer to OP's code:

testfun <- function(df = df2, vars = letters[1:4]){
  plotlist <- list()

  plot.fn <- function(var) {
      data <- df %>% select(x, var)
      p <- ggplot(data,
          aes(x = x, y = get(var)))  
          geom_point()  
          ylab(var)
      plotlist[[ll]] <<- p
  }
  
  for (ll in vars){
    plot.fn(ll)
  }
  return(plotlist)
}

pl <- testfun(df2)
pl

There are 4 distinct variables called var, and in each iteration of the loop we reference a different one.