I am creating a function to automate importing several data sets, and am having difficulty with variable value assignment in the pipeline.
The Task:
I have 20 data sets that need to be imported and have the same variables calc'ed and created. The input data sets contain the same fields, except for one field whose name can be one of three possible values ('varYEAR' where 'YEAR' is a one of three possible years), but whose underlying data is of the same type. I am attempting to assign the values of this varYEAR vector to a uniformly-named variable 'varXXXX' so that further mutations may reference this latter variable and not have to worry about the original variable name. The code below has been simplified to focus on the crux of the issue.
Current Example Code/Output:
set_import <- function(input_path,year) {
temp_set <- read_table(input_path) %>%
mutate(MSAXXXX = ifelse(exists('var2003'),var2003,var2013))
}
View(temp_set)
var2003 varXXXX
1 41929 41929
2 33820 41929
3 27642 41929
4 88111 41929
I have tried a few different approaches to this, including:
mutate(varXXXX = ifelse('var2003' %in% names(.),var2003,var2013)) as well as trying it outside of the pipeline with temp_set$MSAXXXX = as.character(ifelse('var2003' %in% names(temp_set),temp_set$var2003,var2013)). Each has resulted in the same output set above.
Desired Output:
View(temp_set)
var2003 varXXXX
1 41929 41929
2 33820 33820
3 27642 27642
4 88111 88111
The best I can surmise right now is that the 'if' statement is predicated on a single check, and for some reason it populates the varXXXX field with the first value it finds in said check. I'm at a loss, and would really appreciate some help! Thanks in advance.
CodePudding user response:
Solution
Simply use regex to change the column names:
temp_set <- read_table(input_path)
names(temp_set) <- gsub(x = names(temp_set), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX")
Or equivalently with `names<-`() in the dplyr workflow:
temp_set <- read_table(input_path) %>%
`names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))
Regex
The pattern = "^(. )(\\d{4,4})$" breaks each name into two capturing groups:
- Any prefix of positive length:
. - Some year comprised of
4digits:\\d{4,4}
The replacement = "\\1XXXX" then prepends the first group (\\1) to the code (XXXX); so the code essentially "replaces" the year.
Example
Here are two possible cases, where the MSAXXXX column starts as MSA2003 and as MSA2013 respectively:
case_1 <- data.frame(
MSA2003 = c(41929, 33820, 27642, 88111),
var2019 = c(41929, 33820, 27642, 88111),
other_var = 1:4
)
case_1
#> MSA2003 var2019 other_var
#> 1 41929 41929 1
#> 2 33820 33820 2
#> 3 27642 27642 3
#> 4 88111 88111 4
case_2 <- data.frame(
MSA2013 = c(41929, 33820, 27642, 88111),
var2009 = c(41929, 33820, 27642, 88111),
other_var = 1:4
)
case_2
#> MSA2013 var2009 other_var
#> 1 41929 41929 1
#> 2 33820 33820 2
#> 3 27642 27642 3
#> 4 88111 88111 4
Notice how the solution standardizes all variables with years in their names, yet leaves the other variables untouched:
library(dplyr)
case_1 %>%
`names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))
#> MSAXXXX varXXXX other_var
#> 1 41929 41929 1
#> 2 33820 33820 2
#> 3 27642 27642 3
#> 4 88111 88111 4
case_2 %>%
`names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))
#> MSAXXXX varXXXX other_var
#> 1 41929 41929 1
#> 2 33820 33820 2
#> 3 27642 27642 3
#> 4 88111 88111 4
