Home > Back-end >  How to efficiently apply multiple functions simultaneously to the same dataframe and save the result
How to efficiently apply multiple functions simultaneously to the same dataframe and save the result

Time:01-11

I want to apply several different functions simultaneously to one dataframe, then put the results into a list of dataframes. So, for example, I could arrange by one column, then save the output as a new dataframe. Or I could filter some data, then save as another new dataframe (and so on). I feel like there must be an easy way to do this with purrr or apply, but am unsure how to proceed. So, I'm wondering if there is a way to give a list of functions, then return a list of dataframes. Here are some example functions that I apply to mtcars:

library(tidyverse)

filter_df <- function(x, word) {
  x %>% 
    tibble::rownames_to_column("ID") %>% 
    filter(str_detect(ID, word))
}
a <- filter_df(mtcars, "Merc")


mean_n_df <- function(x, grp, mean2) {
  x %>%
    group_by({{grp}}) %>%
    summarise(mean = mean({{mean2}}), n = n())
}
b <- mean_n_df(mtcars, grp = cyl, mean2 = wt)



rating <- function(x, a, b, c) {
  x %>% 
    rowwise %>% 
    mutate(rating = ({{a}}*2)   ({{b}}-5) * abs({{c}} - 30))
  
}
c <- rating(mtcars, a = cyl, b = drat, c = qsec)



pct <- function(data, var, round = 4){
  var_expr <- rlang::enquo(var)
  colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
  
  data %>%
    mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
             round(round))
}
d <- pct(mtcars, mpg)

I know that I could run the code above, then just bind each dataframe into a list.

df_list <- list(mtcars, a, b, c, d)

str(df_list, 1)[[1]]

List of 5
 $ :'data.frame':   32 obs. of  11 variables:
 $ :'data.frame':   7 obs. of  12 variables:
 $ : tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
 $ : rowwise_df [32 × 12] (S3: rowwise_df/tbl_df/tbl/data.frame)
  ..- attr(*, "groups")= tibble [32 × 1] (S3: tbl_df/tbl/data.frame)
 $ :'data.frame':   32 obs. of  12 variables:

CodePudding user response:

This seems a bit bespoke (since each function requires different parameters), but I'd use Map (or purrr::map2 or purrr::pmap), passing a function and the args for it:

filter_df <- function(x, word) {
  x %>% 
    tibble::rownames_to_column("ID") %>% 
    filter(str_detect(ID, word))
}
mean_n_df <- function(x, grp, mean2) {
  x %>%
    group_by({{grp}}) %>%
    summarise(mean = mean({{mean2}}), n = n())
}
rating <- function(x, a, b, c) {
  x %>% 
    rowwise %>% 
    mutate(rating = ({{a}}*2)   ({{b}}-5) * abs({{c}} - 30))
}
pct <- function(data, var, round = 4){
  var_expr <- rlang::enquo(var)
  colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
  data %>%
    mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
             round(round))
}

The call:

out <- Map(
  function(fun, args) do.call(fun, c(list(mtcars), args)),
  list(filter_df, mean_n_df, rating, pct),
  list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
       list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
       list(quo(mpg)))
)

lapply(out, head, 3)
# [[1]]
#          ID  mpg cyl  disp  hp drat   wt qsec vs am gear carb
# 1 Merc 240D 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
# 2  Merc 230 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
# 3  Merc 280 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
# [[2]]
# # A tibble: 3 x 3
#     cyl  mean     n
#   <dbl> <dbl> <int>
# 1     4  2.29    11
# 2     6  3.12     7
# 3     8  4.00    14
# [[3]]
# # A tibble: 3 x 12
# # Rowwise: 
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb rating
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4  -2.89
# 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4  -2.28
# 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1  -5.10
# [[4]]
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb    mpg_pct
# Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.03266449
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.03266449
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.03546430

A few things:

  • Because you demonstrated using the unevaluated symbols (grp=cyl), we have to quote them first, otherwise they would be evaluated before reaching the functions.

  • You can general this out to arbitrary data by not hard-coding it in the Map anon-func, with:

    out <- Map(
      function(x, fun, args) do.call(fun, c(list(x), args)),
      list(mtcars),
      list(filter_df, mean_n_df, rating, pct),
      list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
           list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
           list(quo(mpg)))
    )
    

    where the list(.) around mtcars is intentional: it appears as length-1 to Map, so it is recycled for the other args (length 4 each). Without list, Map would fail because the first function would see the first column (as a vector), second function second column (and/or warning with longer argument not a multiple of length of shorter ... I really wish mis-aligned recycling in R would fail harder than that).

    This generalization allows this sequence of functions to be applied each to multiple datasets:

    out2 <- lapply(list(mtcars[1:10,], mtcars[11:32,]), function(XYZ) {
      Map(
        function(x, fun, args) do.call(fun, c(list(x), args)),
        list(XYZ),
        list(filter_df, mean_n_df, rating, pct),
        list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
             list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
             list(quo(mpg)))
      )
    })
    

    Not sure if you're intending the inception of applying a list of functions to a list of datasets ...

CodePudding user response:

Using invoke with map2 from purrr

library(purrr)
df_list2 <- c(list(mtcars), map2(list(filter_df, mean_n_df, rating, pct), 
   list("Merc", expression(grp = cyl, mean2 = wt), 
       expression(a = cyl, b= drat, c = qsec), quote(mpg)), 
     ~ invoke(.x, c(list(mtcars), as.list(.y)))))

-checking

all.equal(df_list2, df_list, check.attributes = FALSE)
[1] TRUE
  •  Tags:  
  • Related