I am trying to clean some garbage out of some text. While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.
I've come up with this:
gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)
in which string is the character vector, but this doesn't work. Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess. I'm trying to remove the whole word with the pattern in it.
Any ideas?
CodePudding user response:
You need
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
See an R demo:
string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
All yielding [1] "This is a short word".
See the regex demo. Regex details:
\s*- zero or more whitespaces\p{L}*/[[:alpha:]]*- zero or more letters(\p{L})- Capturing group 1: any single letter\1{2}- two occurrences of the same value as in Group 1\p{L}*/[[:alpha:]]*- zero or more letters.
CodePudding user response:
You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:
gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"
CodePudding user response:
r2evans example with different regex:
gsub("(\\w)\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
