I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.
Here is an example of what I am trying to do with made-up data:
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
For example, the code below shows an example of what I want to do for a single pair of columns.
mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))
#OUTPUT:
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:
col.list = c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
#OUTPUT:
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[[2]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 2
3 Carmen NA 3 0 1 <NA> yum 0
4 Dinah 6 5 4 4 gross <NA> 4
5 Billie NA NA NA 2 <NA> gross 2
[[3]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross yum
2 Ella 4 4 2 2 yum yum yum
3 Carmen NA 3 0 1 <NA> yum yum
4 Dinah 6 5 4 4 gross <NA> gross
5 Billie NA NA NA 2 <NA> gross gross
The problems with this attempt are:
- the new columns are not correctly named (they should be named
cheese,milkandtofurather than all be calledv) - the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in
col.list).
Thanks so much for any help!
CodePudding user response:
(1) You have to wrap v into the curly-curly operator and use :=:
library(dplyr)
col.list <- c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
returns
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[...]
which is one step closer to your desired output.
(2) But to get your desired output, I suggest using purrr:
library(purrr)
library(dplyr)
col.list %>%
map(~mydata %>%
select(name, starts_with(.x)) %>%
mutate({{ .x }} := ifelse(
is.na(!!sym(paste0(.x, ".1"))),
!!sym(paste0(.x, ".2")),
!!sym(paste0(.x, ".1"))
)
)
) %>%
reduce(left_join, by = "name")
This returns
name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2 tofu
1 Sarah 1 1 1 NA 1 1 yum gross yum
2 Ella 4 4 4 2 2 2 yum yum yum
3 Carmen NA 3 3 0 1 0 <NA> yum yum
4 Dinah 6 5 6 4 4 4 gross <NA> gross
5 Billie NA NA NA NA 2 2 <NA> gross gross
CodePudding user response:
Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1 and 2
library(dplyr)
library(tidyr)
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(-name,
names_to = c("food", "nr"),
names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#> name food nr value
#> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1
#> 2 Sarah cheese 2 1
#> 3 Sarah milk 1 <NA>
#> 4 Sarah milk 2 1
#> 5 Sarah tofu 1 yum
#> 6 Sarah tofu 2 gross
#> 7 Ella cheese 1 4
#> 8 Ella cheese 2 4
#> 9 Ella milk 1 2
#> 10 Ella milk 2 2
#> # ... with 20 more rows
Apply ifelse() function after transforming back to different wide format
mydata_wide <- mydata_long %>%
pivot_wider(names_from = nr,
values_from = value) %>%
mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>%
arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#> name food `1` `2` final_val
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1 1
#> 2 Ella cheese 4 4 4
#> 3 Carmen cheese <NA> 3 3
#> 4 Dinah cheese 6 5 6
#> 5 Billie cheese <NA> <NA> <NA>
#> 6 Sarah milk <NA> 1 1
#> 7 Ella milk 2 2 2
#> 8 Carmen milk 0 1 0
#> 9 Dinah milk 4 4 4
#> 10 Billie milk <NA> 2 2
#> 11 Sarah tofu yum gross yum
#> 12 Ella tofu yum yum yum
#> 13 Carmen tofu <NA> yum yum
#> 14 Dinah tofu gross <NA> gross
#> 15 Billie tofu <NA> gross gross
mydata_wide2 <- mydata_wide %>%
pivot_wider(-c(`1`, `2`),
names_from = food,
values_from = final_val)
mydata_wide2
#> # A tibble: 5 x 4
#> name cheese milk tofu
#> <chr> <chr> <chr> <chr>
#> 1 Sarah 1 1 yum
#> 2 Ella 4 2 yum
#> 3 Carmen 3 0 yum
#> 4 Dinah 6 4 gross
#> 5 Billie <NA> 2 gross
Created on 2021-10-29 by the reprex package (v2.0.1)
CodePudding user response:
I would use purrr::map_dfc and coalesce here. Looks pretty straightforward.
library(purrr)
library(dplyr)
library(stringr)
mydata %>% mutate(map2_dfc(select(., ends_with('1')),
select(., ends_with('2')),
~coalesce(.x, .y)))%>%
select(-ends_with('2'))%>%
rename_with(~str_remove(.x, '\\.\\d $'))
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
CodePudding user response:
Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.
library(tidyverse)
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)
Which gives
# A tibble: 5 x 4
name cheese milk tofu
<chr> <dbl> <dbl> <chr>
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
Alternative solution with coalesce:
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)
CodePudding user response:
Here is how you can achieve your task:
- define your pairs (in case you have hundreds of columns, this could be automated.
- use
imap_dfcto applycoalescedo the defined pairs - bind to original dataframe
library(dplyr)
library(purrr)
pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))
imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>%
bind_cols(mydata)
cheese milk tofu name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1 1 1 yum Sarah 1 1 NA 1 yum gross
2 4 2 yum Ella 4 4 2 2 yum yum
3 3 0 yum Carmen NA 3 0 1 <NA> yum
4 6 4 gross Dinah 6 5 4 4 gross <NA>
5 NA 2 gross Billie NA NA NA 2 <NA> gross
CodePudding user response:
Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1, cheese.2, and cheese.3) this will still work. This will also work if the columns are unbalanced (eg 3 cheese columns, but only 2 milk columns):
library(purrr)
library(stringr)
library(rlang)
str_exp <- map(set_names(col.list), ~ str_glue("exec(coalesce, !!!across(starts_with(\"{.x}\")))"))
express <- map(str_exp, parse_expr)
mydata %>%
mutate(!!! express,
.keep = "unused")
Output
The other columns were removed by .keep = "unused". If you want to keep all the columns then delete that argument.
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
How it works
- The use of
mapandset_namesis important because this creates a named list, which is important for the big-bang!!!operator later. The firstmapcreates a named list of string representations of expressions. - The use of
execandcoalesceallows the dynamic tidy-selection of columns. - The second
mapturns each of these elements into an expression. As mentioned in the comments by @MartinGal you could combine this into onemapwith a pipe, but I broke it out so it was more readable. - The
!!!operator force-splices the list of objects and the names for the columns are from the list names set up usingmapandset_names.
