I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.
Data:
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
Desired output:
A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
Is there an elegant and smart way to this? mutate_all?
CodePudding user response:
We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d ) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns
library(dplyr)
library(stringr)
DF %>%
mutate(across(matches("^VAR\\d $"),
~ replace(., !substr(., 1, 2) %in% c("AA", "DD"), NA)),
across(ends_with("DATE"),
~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))
-output
# A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 <NA> <NA>
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 <NA> <NA> DDCC 2001-01-03
4 4 DDAA 2001-01-04 <NA> <NA>
5 5 <NA> <NA> <NA> <NA>
CodePudding user response:
Use mutate/across with the assignment function is.na<-.
DF %>%
mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
# ID VAR1 VAR1DATE VAR2 VAR2DATE
# <int> <chr> <chr> <chr> <chr>
#1 1 AABB NA NA NA
#2 2 AACC NA AACC NA
#3 3 NA NA DDCC NA
#4 4 DDAA NA NA NA
#5 5 NA NA NA NA
Or simpler:
DF %>%
mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))
CodePudding user response:
Here is a tidyverse solution. Using across with str_replace_all and appending two ifelse statements.
library(dplyr)
library(stringr)
DF %>%
mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>%
mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
CodePudding user response:
Another possibility, using tidyr::unite and tiydr::separate:
library(tidyverse)
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
DF %>%
unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>%
mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>%
separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>%
separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")
#> # A tibble: 5 × 5
#> ID VAR1 VAR1DATE VAR2 VAR2DATE
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 AABB 2001-01-01 <NA> <NA>
#> 2 2 AACC 2001-01-02 AACC 2001-01-02
#> 3 3 <NA> <NA> DDCC 2001-01-03
#> 4 4 DDAA 2001-01-04 <NA> <NA>
#> 5 5 <NA> <NA> <NA> <NA>
CodePudding user response:
Here is another tidyverse solution with using str_detect to determine where to convert to NA for the date columns. Then, we can use the same type of function on VAR1 and VAR2.
library(tidyverse)
DF %>%
rowwise %>%
mutate(
VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
) %>%
mutate(across(c(VAR1, VAR2), function(x)
ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))
Output
# A tibble: 5 × 5
# Rowwise:
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
