I am creating a function to automate importing several data sets, and am having difficulty with variable value assignment in the pipeline.

The Task:

I have 20 data sets that need to be imported and have the same variables calc'ed and created. The input data sets contain the same fields, except for one field whose name can be one of three possible values ('varYEAR' where 'YEAR' is a one of three possible years), but whose underlying data is of the same type. I am attempting to assign the values of this varYEAR vector to a uniformly-named variable 'varXXXX' so that further mutations may reference this latter variable and not have to worry about the original variable name. The code below has been simplified to focus on the crux of the issue.

Current Example Code/Output:

set_import <- function(input_path,year) {
   temp_set <- read_table(input_path) %>%
     mutate(MSAXXXX = ifelse(exists('var2003'),var2003,var2013))
}

View(temp_set)

       var2003        varXXXX
 1     41929          41929
 2     33820          41929
 3     27642          41929
 4     88111          41929

I have tried a few different approaches to this, including: mutate(varXXXX = ifelse('var2003' %in% names(.),var2003,var2013)) as well as trying it outside of the pipeline with temp_set$MSAXXXX = as.character(ifelse('var2003' %in% names(temp_set),temp_set$var2003,var2013)). Each has resulted in the same output set above.

Desired Output:

View(temp_set)

       var2003        varXXXX
 1     41929          41929
 2     33820          33820          
 3     27642          27642          
 4     88111          88111

The best I can surmise right now is that the 'if' statement is predicated on a single check, and for some reason it populates the varXXXX field with the first value it finds in said check. I'm at a loss, and would really appreciate some help! Thanks in advance.

CodePudding user response：

Solution

Simply use regex to change the column names:

temp_set <- read_table(input_path)

names(temp_set) <- gsub(x = names(temp_set), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX")

Or equivalently with `names<-`() in the dplyr workflow:

temp_set <- read_table(input_path) %>%
  `names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))

Regex

The pattern = "^(. )(\\d{4,4})$" breaks each name into two capturing groups:

Any prefix of positive length: .
Some year comprised of 4 digits: \\d{4,4}

The replacement = "\\1XXXX" then prepends the first group (\\1) to the code (XXXX); so the code essentially "replaces" the year.

Example

Here are two possible cases, where the MSAXXXX column starts as MSA2003 and as MSA2013 respectively:

case_1 <- data.frame(
  MSA2003 = c(41929, 33820, 27642, 88111),
  var2019 = c(41929, 33820, 27642, 88111),
  other_var = 1:4
)
case_1
#>   MSA2003 var2019 other_var
#> 1   41929   41929         1
#> 2   33820   33820         2
#> 3   27642   27642         3
#> 4   88111   88111         4

case_2 <- data.frame(
  MSA2013 = c(41929, 33820, 27642, 88111),
  var2009 = c(41929, 33820, 27642, 88111),
  other_var = 1:4
)
case_2
#>   MSA2013 var2009 other_var
#> 1   41929   41929         1
#> 2   33820   33820         2
#> 3   27642   27642         3
#> 4   88111   88111         4

Notice how the solution standardizes all variables with years in their names, yet leaves the other variables untouched:

library(dplyr)

case_1 %>%
  `names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))
#>   MSAXXXX varXXXX other_var
#> 1   41929   41929         1
#> 2   33820   33820         2
#> 3   27642   27642         3
#> 4   88111   88111         4


case_2 %>%
  `names<-`(gsub(x = names(.), pattern = "^(. )(\\d{4,4})$", replacement = "\\1XXXX"))
#>   MSAXXXX varXXXX other_var
#> 1   41929   41929         1
#> 2   33820   33820         2
#> 3   27642   27642         3
#> 4   88111   88111         4