I am looking for a way to split up a string, but instead of splitting by an underscore or specific word, I would want to split from a series of words - and also not have that word deleted. For example,
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")
x <- paste(a, b, sep = "_")
I then would like a way to separate x into a and b.
CodePudding user response:
it is a bit difficult as the 4th and 5th value include what you are using to paste the strings. The strsplit() function can be used in general for splitting string by specific separators, but you run into some troubles and to solve them you have to know what b looks like at least to make sure you are not separating incorrectly (or use a unique separator):
strsplit(x, split = "_")
[[1]]
[1] "Hello" "sum"
[[2]]
[1] "Joe" "sum" "one"
[[3]]
[1] "Simpsons" "sum"
[[4]]
[1] "Oh" "No" "sum" "one"
[[5]]
[1] "Hiya" "Hi" "sum"
[[6]]
[1] "oh" "sum" "one"
The result is a list where each string is a list item in form of a string vector of diferent lengths.
An option can be to use the value of b as splitter:
rd <- strsplit(x, split = paste0(paste0("_",b), collapse = "|"))
rd
[[1]]
[1] "Hello"
[[2]]
[1] "Joe"
[[3]]
[1] "Simpsons"
[[4]]
[1] "Oh_No"
[[5]]
[1] "Hiya_Hi"
[[6]]
[1] "oh"
# convert this to a vector:
a <- unlist(rd)
a
[1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Now you use this info the other way arroung:
b <- unique(gsub(paste0(paste0(a, "_"), collapse = "|"),"", x))
b
[1] "sum" "sum_one"
CodePudding user response:
As @Gregor Thomas already said in comments, your information is lost. However, depending on the context, there is a way of storing the information in an attribute using a home-grown my_paste function for which we also write a print method and a my_unpaste function.
Here a sketch of the idea:
my_paste <- \(..., sep=" ", collapse=NULL, recycle0=FALSE) { ## new paste fun
o <- `attr<-`(paste(..., sep=sep), 'unpaste', list(...))
return(structure(o, class=c('character', 'my_paste')))
}
print.my_paste <- function(x) { ## print method for class `my_paste'
print(as.character(x))
}
my_unpaste <- \(x, warn=TRUE) { ## the un-paste function
if (!inherits(x, 'my_paste')) {
if (warn) warning('Nothing to unpaste.')
return(x)
} else {
return(attr(x, 'unpaste'))
}
}
Usage
x <- my_paste(a, b, sep='_')
Looks like this,
str(x)
# 'my_paste' chr [1:6] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
# - attr(*, "unpaste")=List of 2
# ..$ : chr [1:6] "Hello" "Joe" "Simpsons" "Oh_No" ...
# ..$ : chr [1:2] "sum" "sum_one"
but prints normal:
x ## or more verbose `print(x)`
# [1] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
Now un-paste!
my_unpaste(x)
# [[1]]
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
#
# [[2]]
# [1] "sum" "sum_one"
Has a warning:
my_unpaste(a)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
# Warning message:
# In my_unpaste(a) : Nothing to unpaste.
my_unpaste(a, warn=FALSE)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Note: R >= 4.1 used.
Data:
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")
