I have this string:
seed_pattern <- "K?ED??HRDDKDKD?HE?REKE??DE?KKK"
given another string
bb_seq <- "rhhhhitv"
What I'd like to do is to replace ? with a character in bb_seq by keeping the order of bb_seq resulting in :
The total length of ? is guaranteed to be the same with bb_seq.
KrEDhhHRDDKDKDhHEhREKEitDEvKKK
How can I achieve that with R?
I tried this but failed:
seed_pattern <- "K?ED??HRDDKDKD?HE?REKE??DE?KKK"
bb_seq <- "rhhhhitv"
sp <- seed_pattern
gr <- gregexpr("\\? ", sp)
csml <- lapply(gr, function(sp) cumsum(attr(sp, "match.length")))
regmatches(sp, gr) <- lapply(csml, function(sp) substring(bb_seq, c(1, sp[1]), sp))
sp
# KrEDrhhHRDDKDKDrhhhHErhhhhREKErhhhhitDErhhhhitvKKK
I'm open to non-regex solutions.
CodePudding user response:
Split, replace, combine:
> target <- strsplit(seed_pattern, "")[[1]]
> replacement <- strsplit(bb_seq, "")[[1]]
> target[target=="?"] <- replacement
> paste(target, collapse = "")
[1] "KrEDhhHRDDKDKDhHEhREKEitDEvKKK"
CodePudding user response:
You can do this (perhaps not very efficiently) by replacing one ? at a time:
seed_pattern <- "K?ED??HRDDKDKD?HE?REKE??DE?KKK"
bb_seq <- "rhhhhitv"
for (ch in unlist(strsplit(bb_seq, ""))) {
print(ch)
seed_pattern <- sub("?", ch, seed_pattern, fixed = TRUE)
}
print(seed_pattern)
# [1] "KrEDhhHRDDKDKDhHEhREKEitDEvKKK"
Sadly sub is not vectorized over the replacement argument!
CodePudding user response:
Here is a long way. I can't still do these things without thinking in tibbles or data frames . Hoping that someday I will grasp this:
library(dplyr)
library(tidyr)
tibble(seed_pattern, bb_seq) %>%
separate_rows(seed_pattern, sep='\\?') %>%
mutate(seed_pattern = paste(paste0(seed_pattern, substr(bb_seq, row_number(), row_number())), collapse = "")) %>%
slice(1) %>%
pull(seed_pattern)
[1] "KrEDhhHRDDKDKDhHEhREKEitDEvKKK"
CodePudding user response:
You can do this in a one-liner with a slight change to the solution you received from your earlier question (thanks @thelatemail):
regmatches(seed_pattern, gregexpr("\\?", seed_pattern)) <- strsplit(bb_seq, "")
Check it provides the expected result:
seed_pattern == "KrEDhhHRDDKDKDhHEhREKEitDEvKKK"
[1] TRUE
