Iterating name of a field with dplyr::summarise function-CodePudding

first time for me here, I'll try to explain you my problem as clearly as possible. I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.

Here's a example :

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
  cur_df <- df
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}

I tried with

for (df in lst_df){
  cur_df <- df
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(cur_camp = mean(cur_camp))
}

but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.

How can I build the current_name_of_erosion_field here ?

CodePudding user response：

We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(!!cur_camp := mean(!! sym(cur_camp)))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12

Or may use across

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(across(all_of(cur_camp), mean))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12

CodePudding user response：

A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.

library(tidyverse)

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))

bind_rows(df1, df2) %>%
  pivot_longer(starts_with('ERO'), 
               names_to = 'ERO',
               values_drop_na = TRUE) %>%
  group_by(ID, ERO) %>%
  summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups:   ID [4]
#>      ID ERO   value
#>   <dbl> <chr> <dbl>
#> 1     1 ERO13   3  
#> 2     2 ERO13   6  
#> 3     4 ERO17   4.5
#> 4     6 ERO17  12

^{Created on 2022-01-14 by the reprex package (v2.0.0)}