Home > Software design >  Replace a whole word containing a pattern - gsub and R
Replace a whole word containing a pattern - gsub and R

Time:01-08

I am trying to clean some garbage out of some text. While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.

I've come up with this:

gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)

in which string is the character vector, but this doesn't work. Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess. I'm trying to remove the whole word with the pattern in it.

Any ideas?

CodePudding user response:

You need

gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")

See an R demo:

string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")

All yielding [1] "This is a short word".

See the regex demo. Regex details:

  • \s* - zero or more whitespaces
  • \p{L}* / [[:alpha:]]* - zero or more letters
  • (\p{L}) - Capturing group 1: any single letter
  • \1{2} - two occurrences of the same value as in Group 1
  • \p{L}* / [[:alpha:]]* - zero or more letters.

CodePudding user response:

You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:

gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"

CodePudding user response:

r2evans example with different regex:

gsub("(\\w)\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
  •  Tags:  
  • Related