I tried to apply the below rules:
Chop the string by ; to reach maximum length n.
For example,
n <- 4
string <- c("a;a;aabbbb;ccddee;ff")
output <- c("a;a;", "aabb", "bb;", "ccdd", "ee;", "ff")
For "aabb", since the chop length "aabbbb" exceed n = 4, thus we chop by length, 4.
For "bb;", since the chop length "bb;" < 4, we next consider "bb;ccddee". However, the length of next chop exceed 4, and we already have ; exist in the string. Thus, we chop by ;.
Currently, I can achieved or by using the Regex.
num <- 4
splitvar <- ";"
## splits pattern
pattern <- paste0("(?<=.{", num, "}|", splitvar, ")")
> pattern
[1] "(?<=.{4}|;)"
string <- c("a;a;aabbbb;ccddee;ff")
strsplit(string, pattern, perl = TRUE)
[[1]]
[1] "a;" "a;" "aabb" "bb;" "ccdd" "ee;" "ff"
As you can see, we don't actually need to chop "a;" and "a;", since the length doesn't exceed the n (2 2 = 4).
Do anyone have solution on this? Thank you.
CodePudding user response:
Your regex matches either a splitvar or a location that is preceded by at least any num chars.
The pattern you seek is a regex matching either any one, two or three chars and then a splitvar or any num chars other than splitvar char.
So, you can use
num <- 4
splitvar <- ";"
pattern <- paste0(".{1,", num-1, "}(?:",splitvar,"|$)|[^",splitvar,"]{",num,"}")
pattern ## => .{1,3}(?:;|$)|[^;]{4}
string <- c("a;a;aabbbb;ccddee;ff")
unlist(regmatches(string, gregexpr(pattern, string)))
## => "a;a;" "aabb" "bb;" "ccdd" "ee;" "ff"
With stringr:
library(stringr)
unlist(str_extract_all(string, pattern))
See the R demo online. See the regex demo.
Details:
.{1,3}(?:;|$)- one, two, or three chars (other than line break chars if you usestringr) as many as possible, and then a;char or end of string|- or[^;]{4}- any four chars other than a;char.
