Home > Back-end >  Unexpected behavior by str_remove_* in stringr package
Unexpected behavior by str_remove_* in stringr package

Time:01-21

So I am working on a set of very simple tasks which widely use stringr. One task is to remove a specific pattern from a string.

Below is a toy sample which contains columns temp and current_house. I want to remove the pattern given in current_house from temp and create a new column, say temp2. For a few observations, it seems that str_remove() does not work. I already tried with str_remove_all() without success.

What am I missing? It should not be a problem related to the number of tokens in the search pattern as it successfully removes multi-words patterns.

library(data.table)
library(stringr)


head(df)
#>            temp current_house
#> 1: Lazard 528 2        Lazard
#> 2:   KPMG 525 1          KPMG
#> 3:   KPMG 525 1          KPMG
#> 4:   KPMG 524 4          KPMG
#> 5:   KPMG 524 4          KPMG
#> 6:   KPMG 524 4          KPMG

# adding the new column temp2 by removing the pattern current_house
df[ , temp2 := str_remove(temp, current_house)]

df
#>                                                       temp
#>  1:                                           Lazard 528 2
#>  2:                                             KPMG 525 1
#>  3:                                             KPMG 525 1
#>  4:                                             KPMG 524 4
#>  5:                                             KPMG 524 4
#>  6:                                             KPMG 524 4
#>  7:                                             KPMG 524 4
#>  8: Development and Investment Bank of Turkey (TKYB) 524 4
#>  9: Development and Investment Bank of Turkey (TKYB) 524 4
#> 10: Development and Investment Bank of Turkey (TKYB) 524 4
#> 11: Development and Investment Bank of Turkey (TKYB) 524 4
#> 12: Development and Investment Bank of Turkey (TKYB) 524 4
#> 13: Development and Investment Bank of Turkey (TKYB) 524 4
#> 14: Development and Investment Bank of Turkey (TKYB) 524 4
#> 15: Development and Investment Bank of Turkey (TKYB) 524 4
#> 16:                                         Investec 520 4
#> 17:                         Numis Securities Limited 520 2
#> 18:                         Numis Securities Limited 520 1
#> 19:                                JPMorgan Cazenove 520 1
#> 20:                                JPMorgan Cazenove 520 1
#> 21:                  Fenchurch Advisory Partners LLP 520 1
#> 22:                  Fenchurch Advisory Partners LLP 520 1
#> 23:                  Fenchurch Advisory Partners LLP 520 1
#> 24:                                              EY 518 16
#> 25:                                             KPMG 508 1
#> 26:                                             KPMG 508 1
#> 27:           Capitalmind Corporate Finance Advisory 502 2
#> 28:           Capitalmind Corporate Finance Advisory 502 1
#> 29:             Daiwa Securities Group / DC Advisory 500 3
#> 30:                           LionTree Advisors, LLC 500 1
#> 31:                           LionTree Advisors, LLC 500 1
#> 32:                      Ping'an Securities Co.,Ltd. 496 1
#> 33:                      Ping'an Securities Co.,Ltd. 496 1
#> 34:                      Ping'an Securities Co.,Ltd. 496 1
#> 35:                      Ping'an Securities Co.,Ltd. 496 1
#> 36:                      Ping'an Securities Co.,Ltd. 496 1
#> 37:                      Ping'an Securities Co.,Ltd. 496 1
#> 38:                   Guotai Junan Securities Co Ltd 496 1
#> 39:                   Guotai Junan Securities Co Ltd 496 1
#> 40:                                              EY 493 16
#>                                                       temp
#>                                        current_house
#>  1:                                           Lazard
#>  2:                                             KPMG
#>  3:                                             KPMG
#>  4:                                             KPMG
#>  5:                                             KPMG
#>  6:                                             KPMG
#>  7:                                             KPMG
#>  8: Development and Investment Bank of Turkey (TKYB)
#>  9: Development and Investment Bank of Turkey (TKYB)
#> 10: Development and Investment Bank of Turkey (TKYB)
#> 11: Development and Investment Bank of Turkey (TKYB)
#> 12: Development and Investment Bank of Turkey (TKYB)
#> 13: Development and Investment Bank of Turkey (TKYB)
#> 14: Development and Investment Bank of Turkey (TKYB)
#> 15: Development and Investment Bank of Turkey (TKYB)
#> 16:                                         Investec
#> 17:                         Numis Securities Limited
#> 18:                         Numis Securities Limited
#> 19:                                JPMorgan Cazenove
#> 20:                                JPMorgan Cazenove
#> 21:                  Fenchurch Advisory Partners LLP
#> 22:                  Fenchurch Advisory Partners LLP
#> 23:                  Fenchurch Advisory Partners LLP
#> 24:                                               EY
#> 25:                                             KPMG
#> 26:                                             KPMG
#> 27:           Capitalmind Corporate Finance Advisory
#> 28:           Capitalmind Corporate Finance Advisory
#> 29:             Daiwa Securities Group / DC Advisory
#> 30:                           LionTree Advisors, LLC
#> 31:                           LionTree Advisors, LLC
#> 32:                      Ping'an Securities Co.,Ltd.
#> 33:                      Ping'an Securities Co.,Ltd.
#> 34:                      Ping'an Securities Co.,Ltd.
#> 35:                      Ping'an Securities Co.,Ltd.
#> 36:                      Ping'an Securities Co.,Ltd.
#> 37:                      Ping'an Securities Co.,Ltd.
#> 38:                   Guotai Junan Securities Co Ltd
#> 39:                   Guotai Junan Securities Co Ltd
#> 40:                                               EY
#>                                        current_house
#>                                                      temp2
#>  1:                                                  528 2
#>  2:                                                  525 1
#>  3:                                                  525 1
#>  4:                                                  524 4
#>  5:                                                  524 4
#>  6:                                                  524 4
#>  7:                                                  524 4
#>  8: Development and Investment Bank of Turkey (TKYB) 524 4
#>  9: Development and Investment Bank of Turkey (TKYB) 524 4
#> 10: Development and Investment Bank of Turkey (TKYB) 524 4
#> 11: Development and Investment Bank of Turkey (TKYB) 524 4
#> 12: Development and Investment Bank of Turkey (TKYB) 524 4
#> 13: Development and Investment Bank of Turkey (TKYB) 524 4
#> 14: Development and Investment Bank of Turkey (TKYB) 524 4
#> 15: Development and Investment Bank of Turkey (TKYB) 524 4
#> 16:                                                  520 4
#> 17:                                                  520 2
#> 18:                                                  520 1
#> 19:                                                  520 1
#> 20:                                                  520 1
#> 21:                                                  520 1
#> 22:                                                  520 1
#> 23:                                                  520 1
#> 24:                                                 518 16
#> 25:                                                  508 1
#> 26:                                                  508 1
#> 27:                                                  502 2
#> 28:                                                  502 1
#> 29:                                                  500 3
#> 30:                                                  500 1
#> 31:                                                  500 1
#> 32:                                                  496 1
#> 33:                                                  496 1
#> 34:                                                  496 1
#> 35:                                                  496 1
#> 36:                                                  496 1
#> 37:                                                  496 1
#> 38:                                                  496 1
#> 39:                                                  496 1
#> 40:                                                 493 16
#>                                                      temp2

Created on 2022-01-20 by the reprex package (v2.0.1)

Please find the toy sample below.

df = structure(list(temp = c("Lazard 528 2", "KPMG 525 1", "KPMG 525 1", 
                             "KPMG 524 4", "KPMG 524 4", "KPMG 524 4", "KPMG 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Investec 520 4", 
                             "Numis Securities Limited 520 2", "Numis Securities Limited 520 1", 
                             "JPMorgan Cazenove 520 1", "JPMorgan Cazenove 520 1", "Fenchurch Advisory Partners LLP 520 1", 
                             "Fenchurch Advisory Partners LLP 520 1", "Fenchurch Advisory Partners LLP 520 1", 
                             "EY 518 16", "KPMG 508 1", "KPMG 508 1", "Capitalmind Corporate Finance Advisory 502 2", 
                             "Capitalmind Corporate Finance Advisory 502 1", "Daiwa Securities Group / DC Advisory 500 3", 
                             "LionTree Advisors, LLC 500 1", "LionTree Advisors, LLC 500 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Guotai Junan Securities Co Ltd 496 1", "Guotai Junan Securities Co Ltd 496 1", 
                             "EY 493 16"), 
                    current_house = c("Lazard", "KPMG", "KPMG", "KPMG", 
                                      "KPMG", "KPMG", "KPMG", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Investec", 
                                      "Numis Securities Limited", "Numis Securities Limited", "JPMorgan Cazenove", 
                                      "JPMorgan Cazenove", "Fenchurch Advisory Partners LLP", "Fenchurch Advisory Partners LLP", 
                                      "Fenchurch Advisory Partners LLP", "EY", "KPMG", "KPMG", "Capitalmind Corporate Finance Advisory", 
                                      "Capitalmind Corporate Finance Advisory", "Daiwa Securities Group / DC Advisory", 
                                      "LionTree Advisors, LLC", "LionTree Advisors, LLC", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Guotai Junan Securities Co Ltd", 
                                      "Guotai Junan Securities Co Ltd", "EY")), row.names = c(NA, -40L
                                      ), 
               class = c("data.table", "data.frame"))

CodePudding user response:

The parens in your current_house are being interpreted as regex groups. Use stringr::fixed to fix that:

setDT(df)
df[, temp2 := str_remove(temp, current_house)           # initial, not working
  ][, temp3 := str_remove(temp, fixed(current_house))   # working
  ][]
#                                        temp                           current_house                                   temp2   temp3
#                                      <char>                                  <char>                                  <char>  <char>
#  1:                            Lazard 528 2                                  Lazard                                   528 2   528 2
#  2:                              KPMG 525 1                                    KPMG                                   525 1   525 1
#  3:                              KPMG 525 1                                    KPMG                                   525 1   525 1
#  4:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  5:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  6:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  7:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  8: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
#  9: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
# 10: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
# ---                                                                                                                                
# 31:            LionTree Advisors, LLC 500 1                  LionTree Advisors, LLC                                   500 1   500 1
# 32:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 33:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 34:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 35:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 36:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 37:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 38:    Guotai Junan Securities Co Ltd 496 1          Guotai Junan Securities Co Ltd                                   496 1   496 1
# 39:    Guotai Junan Securities Co Ltd 496 1          Guotai Junan Securities Co Ltd                                   496 1   496 1
# 40:                               EY 493 16                                      EY                                  493 16  493 16

You might want to wrap str_remove with trimws(.), since temp3 here has leading blanks:

head(df$temp3)
# [1] " 528 2" " 525 1" " 525 1" " 524 4" " 524 4" " 524 4"

df[, temp3 := trimws(str_remove(temp, fixed(current_house)))]
head(df$temp3)
# [1] "528 2" "525 1" "525 1" "524 4" "524 4" "524 4"
  •  Tags:  
  • Related