Home > Blockchain >  Iterating name of a field with dplyr::summarise function
Iterating name of a field with dplyr::summarise function

Time:01-15

first time for me here, I'll try to explain you my problem as clearly as possible. I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.

Here's a example :

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
  cur_df <- df
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}

I tried with

for (df in lst_df){
  cur_df <- df
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(cur_camp = mean(cur_camp))
}

but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.

How can I build the current_name_of_erosion_field here ?

CodePudding user response:

We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(!!cur_camp := mean(!! sym(cur_camp)))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12  

Or may use across

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(across(all_of(cur_camp), mean))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12  

CodePudding user response:

A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.

library(tidyverse)

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))

bind_rows(df1, df2) %>%
  pivot_longer(starts_with('ERO'), 
               names_to = 'ERO',
               values_drop_na = TRUE) %>%
  group_by(ID, ERO) %>%
  summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups:   ID [4]
#>      ID ERO   value
#>   <dbl> <chr> <dbl>
#> 1     1 ERO13   3  
#> 2     2 ERO13   6  
#> 3     4 ERO17   4.5
#> 4     6 ERO17  12

Created on 2022-01-14 by the reprex package (v2.0.0)

  •  Tags:  
  • Related