Home > Back-end >  replacing values of a string variable in stringr
replacing values of a string variable in stringr

Time:01-15

Many values of my data frame are written differently although they were referring to the same value. I should change some of the column values to make them similar. I used stringr package str_replace_all, but it didn't work very well. It is not doing what I wanted it. Here is my reproducible data and the code.

    df <- data.frame(
  stringsAsFactors = FALSE,
              Var1 = c("16-pathway","16a-OH E1",
                       "16a-OHE1","16OHE","17-b-estradiol","17-OH-progesterone",
                       "17-OH-progesterone/ androstenedione ratio",
                       "17b-HSD (rs2830A)","17b-HSD (rs592389 G)","17b-HSD (rs615492 G)",
                       "17b-HSD (rs615942 G)","17b estradiol",
                       "17OH-progesterone","2-hydroxy (OH) E1","2-OHE-1","2-OHE-2",
                       "2-pathway","2:16 OHE ratio","2:16 pathway ratio","2:16a-OH E1",
                       "2:16OHE","2OHE","Adiponectin","androstenedione",
                       "Androstenedione","androstenedione  (A)"),
              Freq = c(2L,1L,4L,8L,1L,6L,6L,2L,
                       2L,1L,1L,1L,5L,1L,4L,4L,2L,4L,2L,1L,8L,8L,
                       8L,1L,62L,1L)
  )

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                                  c(#16OHE1
                                    "16a-OH E1" = "16-OHE1", 
                                    "16a-OHE1" = "16-OHE1", 
                                    "16OHE" = "16-OHE1",
                                    
                                    #17Beta estradiol
                                    "17-b-estradiol" = "17-b-estradiol",
                                    "17b estradiol"= "17-b-estradiol",
                                    #Andreostenedione

                                    "androstenedione" = "Androstenedione",
                                    "Androstenedione" = "Androstenedione",
                                    "androstenedione  (A)" = "Androstenedione",

                                    #2-OHE-1
                                    "2-OHE-1" = "2-OHE-1",
                                    "2-hydroxy (OH) E1" = "2-OHE-1")
)

Now, if you compare Var1 and new_var1, it didn't work to change "2-hydroxy (OH) E1" to "2-OHE-1" and "Androstenedione (A)" to "Androstenedione". See screenshots below. enter image description here

enter image description here

CodePudding user response:

In str_replace_all you need to escape the ( and ) by using "double " in front. Try the below it works. :)

df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 "androstenedione  \\(A\\)" = "Androstenedione",
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \\(OH\\) E1" = "2-OHE-1"))

CodePudding user response:

There are two things you need to change in your code to obtain the desired output. The first one is the one @Emax mentioned: escaping parentheses with double backslashes (\\( and \\)). Additionally, you need to pay attention to the order of the replacements, as certain replacements might affect the outcome of following replacements. That is the reason in your OP "androstenedione \\(A\\)" do not get replaced by "Androstenedione", because the replacement "androstenedione" = "Androstenedione" is happening before "androstenedione \\(A\\)" = "Androstenedione". A simple solution to get the desired output would be to first replace the most specific cases (e.g., "androstenedione \\(A\\)"), before the more general ones (e.g., "androstenedione").

library(stringr)
df$new_var1 <- str_replace_all(df$Var1,
                               c(#16OHE1
                                 "16a-OH E1" = "16-OHE1", 
                                 "16a-OHE1" = "16-OHE1", 
                                 "16OHE" = "16-OHE1",
                                 #17Beta estradiol
                                 "17-b-estradiol" = "17-b-estradiol",
                                 "17b estradiol"= "17-b-estradiol",
                                 #Andreostenedione
                                 "androstenedione  \\(A\\)" = "Androstenedione",
                                 "androstenedione" = "Androstenedione",
                                 "Androstenedione" = "Androstenedione",
                                 #2-OHE-1
                                 "2-OHE-1" = "2-OHE-1",
                                 "2-hydroxy \\(OH\\) E1" = "2-OHE-1")
)

CodePudding user response:

Here's an approach with agrep (Fuzzy Matching) without replacing any parentheses. You can add insertions, deletions and substitutions with agrep for other examples if needed.

replacements

repl <- c(`16a-OH E1` = "16-OHE1", `16a-OHE1` = "16-OHE1", `16OHE` = "16-OHE1", 
`17-b-estradiol` = "17-b-estradiol", `17b estradiol` = "17-b-estradiol", 
androstenedione = "Androstenedione", Androstenedione = "Androstenedione", 
`Androstenedione  (A)` = "Androstenedione", `2-OHE-1` = "2-OHE-1", 
`2-hydroxy (OH) E1` = "2-OHE-1")
df$new_var1 <- sapply(seq_along(df$Var1), function(x){ 
  re=repl[agrep(df$Var1[x], names(repl))][1]; 
  ifelse(is.na(re), df$Var1[x], re) })

df$new_var1
 [1] "16-pathway"                               
 [2] "16-OHE1"                                  
 [3] "16-OHE1"                                  
 [4] "16-OHE1"                                  
 [5] "17-b-estradiol"                           
 [6] "17-OH-progesterone"                       
 [7] "17-OH-progesterone/ androstenedione ratio"
 [8] "17b-HSD (rs2830A)"                        
 [9] "17b-HSD (rs592389 G)"                     
[10] "17b-HSD (rs615492 G)"                     
[11] "17b-HSD (rs615942 G)"                     
[12] "17-b-estradiol"                           
[13] "17OH-progesterone"                        
[14] "2-OHE-1"                                  
[15] "2-OHE-1"                                  
[16] "2-OHE-1"                                  
[17] "2-pathway"                                
[18] "2:16 OHE ratio"                           
[19] "2:16 pathway ratio"                       
[20] "16-OHE1"                                  
[21] "2:16OHE"                                  
[22] "16-OHE1"                                  
[23] "Adiponectin"                              
[24] "Androstenedione"                          
[25] "Androstenedione"                          
[26] "Androstenedione"
  •  Tags:  
  • Related