I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
CodePudding user response:
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|] )\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^beginning of string;[^\\|]*not the pipe character zero or more times;\\|the pipe character needs to be escaped since it's a meta-character;^[^\\|]\\|the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;([^\\|] )group match anything but the pipe character at least once;\\|.*$the pipe plus anything until the end of the string.
Then keep the 1st (and only) group with "\\1".
CodePudding user response:
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.
CodePudding user response:
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
CodePudding user response:
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (. ?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|). ?(?=\\|)")
[1] "ENSG00000004059.11"
