Home > database >  Using gsub to replace matches with same number of characters
Using gsub to replace matches with same number of characters

Time:01-29

Is it possible to use gsub to replace each character of a match with another character? I have read and tried solutions from a lot of questions without success, because they were very specific to the example being used. Some that looked promising but ultimately did not get me there are

gsub-replace-regex-match-with-regex-replacement-string

replace-pattern-with-one-space-per-character-in-perl

What I am looking for is a general way to do the following. I have a list of regexes, which I combine into a single regex expression of the form

pattern <- "[0-9]{3,}|[a-z]{3,}|..."

Given a string such as

x <- "1234 abc 12 a 123456"

I would like to get back from gsub the string with each character of a match replaced by #

"#### ### 12 a ######"

instead of

"# # 12 a #"

I have used gsub with the perl arg set to TRUE, and experimented with an online regex tool, using things like \G and lookarounds, but I cannot figure it out.

The reason I am looking for a way to do this with gsub (I realise it is easy to do in other ways) is to use it as a method of censoring certain words and matches such as dates, phone numbers and email addresses in a dplyr pipeline. The function I have works fine, except that any replacement is fixed, and I would like to replace each matching character, rather than each matching substring.

filter_words <- function(.data, .words, .replacement, ...) {
  .data %>% dplyr::mutate(
    dplyr::across(
      c(...),
      ~ gsub(
          paste0("\\b", .words, collapse = "|\\b"),
          .replacement, .,
          ignore.case = TRUE, perl = TRUE
      )
    )
  )
}

I did try using a package called mgsub for the mgsub_censor function it provides. This does work, but it is several orders of magnitude slower than what I already have, so not really practical for large datasets.

I did try creating a custom gsub function able to accept a function (that could return a string consisting of the same number of characters as each match) as the replacement argument. It worked fine for a single string, but failed to work in a pipe.

CodePudding user response:

You may pass a function in str_replace_all and use strrep to repeat the # symbol n times.

x <- "1234 abc 12 a 123456"
pattern <- "[0-9]{3,}|[a-z]{3,}"

stringr::str_replace_all(x, pattern, function(m) strrep('#', nchar(m)))
#[1] "#### ### 12 a ######"
  •  Tags:  
  • Related