Problem
I need to use a dictionary dataset to determine which columns from a different dataset I should calculate the mean.
Data
I will illustrate my case with the iris dataset (a dataset already in R).
I have two datasets:
- The actual data - like
iris(however the col names is basically a1, a2, a3, a4...). - The dictionary for the former indicates what each
iriscolumn indicates. The columnfeaturein thedictionary_irisis the grouping variable, and it would also be part of the name for the new variables (e.g., the new variable would be calledSepal_mean, orPetal_mean).
Iris dataset
library(dplyr)
iris %>% as_tibble()
# # A tibble: 150 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
# # ... with 140 more rows
Dictionary Dataset
dictionary_iris <- tibble(variables = names(iris)) %>%
separate(variables, into = c("feature", "measure"), remove = FALSE)
dictionary_iris
# # A tibble: 5 x 3
# variables feature measure
# <chr> <chr> <chr>
# 1 Sepal.Length Sepal Length
# 2 Sepal.Width Sepal Width
# 3 Petal.Length Petal Length
# 4 Petal.Width Petal Width
# 5 Species Species NA
Expected Output
I know how to do that manually (see below), but I would like to automatize this process because I have a dataframe with more than 300 columns and want to take 23 different means across those columns.
library(dplyr)
iris %>%
rowwise() %>%
mutate(Sepal_mean = mean(c(Sepal.Length, Sepal.Width), na.rm = TRUE),
Petal_mean = mean(c(Petal.Length, Petal.Width), na.rm = TRUE))
# # A tibble: 150 x 7
# # Rowwise:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_mean Petal_mean
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 4.3 0.8
# 2 4.9 3 1.4 0.2 setosa 3.95 0.8
# 3 4.7 3.2 1.3 0.2 setosa 3.95 0.75
# 4 4.6 3.1 1.5 0.2 setosa 3.85 0.85
# 5 5 3.6 1.4 0.2 setosa 4.3 0.8
# 6 5.4 3.9 1.7 0.4 setosa 4.65 1.05
# 7 4.6 3.4 1.4 0.3 setosa 4 0.85
# 8 5 3.4 1.5 0.2 setosa 4.2 0.85
# 9 4.4 2.9 1.4 0.2 setosa 3.65 0.8
# 10 4.9 3.1 1.5 0.1 setosa 4 0.8
# # ... with 140 more rows
I have the impression that I can do that with dplyr::mutate() and dplyr::across() or with some dplyr::map() function. But I got quite lost.
CodePudding user response:
If the intention is to use 'feature' column as a grouping, then split the 'dictionary_iris' by the 'feature' column (removed the last row (-5) as it is not numeric column, loop over the list with imap, transmute to create the column in 'iris' with the rowMeans of those column name, and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
out <- imap_dfc(split(dictionary_iris$variables[-5],
dictionary_iris$feature[-5]),
~ iris %>%
transmute(!! str_c(.y, "_mean") :=
rowMeans(across(all_of(.x)), na.rm = TRUE))) %>%
bind_cols(iris, .)
-output
> head(out)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1 5.1 3.5 1.4 0.2 setosa 0.80 4.30
2 4.9 3.0 1.4 0.2 setosa 0.80 3.95
3 4.7 3.2 1.3 0.2 setosa 0.75 3.95
4 4.6 3.1 1.5 0.2 setosa 0.85 3.85
5 5.0 3.6 1.4 0.2 setosa 0.80 4.30
6 5.4 3.9 1.7 0.4 setosa 1.05 4.65
CodePudding user response:
iris_means <- iris %>%
mutate(id = row_number()) %>%
pivot_longer(-c(id, Species)) %>%
mutate(name = gsub("\\..*", "", name)) %>%
group_by(id, name) %>%
summarise(val = mean(value)) %>%
ungroup %>%
filter(name %in% !!dictionary_iris$feature) %>%
pivot_wider(-name, values_from = "val", names_glue = "{.name}_mean")
iris %>%
mutate(id = row_number()) %>%
left_join(iris_means) %>%
select(-id)
Joining, by = "id"
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1 5.1 3.5 1.4 0.2 setosa 0.80 4.30
2 4.9 3.0 1.4 0.2 setosa 0.80 3.95
3 4.7 3.2 1.3 0.2 setosa 0.75 3.95
4 4.6 3.1 1.5 0.2 setosa 0.85 3.85
5 5.0 3.6 1.4 0.2 setosa 0.80 4.30
6 5.4 3.9 1.7 0.4 setosa 1.05 4.65
