Example of string in question called gene_snps:
"ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A"
Desired outcome:
markerID
chr9:23143143_A/C
chr9:5322432_G/T
chr9:9840984342_T/C
chr9:5324234:G/T
chr9:324424_T/A
With the ultimate outcome being a table:
CHR POS REF ALT
chr9 23143143 A C
chr9 5322432 G T
chr9 9840984342 T C
chr9 5324234 G T
chr9 324424 T A
I used to have a code to split these up when they were only ";" separated using:
x <- separate_rows(gene_snps, markerIDs, sep=c(";"))
x <- separate_rows(x, col="markerIDs", into=c("pos", "ref_alt"), sep=c("_"))
x <- separate_rows(x, col="pos", into=c("CHR", "POS"), sep=c(":"))
x <- separate_rows(x, col="ref_alt", into=c("REF", "ALT"), sep=c("/"))
but that is now out as the upstream tool used to generate the code now introduces the "ultra_rare" tag which is all "_" separated.
Any help with splitting this string up to get rid of the ultra_rare_variant bit and split each chrx:x_x/x chunk into it's own row would be much appreciated!
All the best
CodePudding user response:
We just need to use a regex with a lookahead:
library(stringr)
x = str_split(str, '[_;](?=chr)', simplify = T)[-1]
x
[1] "chr9:23143143_A/C" "chr9:5322432_G/T" "chr9:9840984342_T/C" "chr9:5324234:G/T"
[5] "chr9:324424_T/A"
This matches either a "_" or a ";" immediately proceeding "chr", then uses that to split the string. The argument simplify=T makes it give the result as a vector instead of a list, and we use [-1] to drop the first element, which is "ultra_rare_variant"
To make it a table, we can just split it again on either ":", "/", or "_" and convert to data frame. Since there are multiple strings and multiple split sites, the argument simplify=T give us a matrix (each row is a string in x, each column is a piece after splitting), which can be converted into a data.frame:
tbl <- as.data.frame(str_split(x, '[:_/]', simplify = TRUE))
colnames(tbl) <- c('CHR', 'POS', 'REF','ALT') # Set column names
tbl
CHR POS REF ALT
1 chr9 23143143 A C
2 chr9 5322432 G T
3 chr9 9840984342 T C
4 chr9 5324234 G T
5 chr9 324424 T A
CodePudding user response:
Use stringr::str_match_all to directly save everything into named capture groups.
x <- "ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A"
pattern <- "(?<CHR>chr\\d)?[_:/](?<POS>\\d )?[_:/](?<REF>[A-Z])?[_:/](?<ALT>[A-Z])?"
as.data.frame(stringr::str_match_all(x, pattern)[[1L]][, -1L])
Output
CHR POS REF ALT
1 chr9 23143143 A C
2 chr9 5322432 G T
3 chr9 9840984342 T C
4 chr9 5324234 G T
5 chr9 324424 T A
CodePudding user response:
Another possible solution:
library(tidyverse)
s <- "ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A"
data.frame(CHR = s %>% str_remove("ultra_rare_variant_")) %>%
separate_rows(CHR, sep=";|_(?=chr9)") %>%
separate(CHR, into = c("CHR","POS","REF","ALT"), sep=":|_|/")
#> # A tibble: 5 × 4
#> CHR POS REF ALT
#> <chr> <chr> <chr> <chr>
#> 1 chr9 23143143 A C
#> 2 chr9 5322432 G T
#> 3 chr9 9840984342 T C
#> 4 chr9 5324234 G T
#> 5 chr9 324424 T A
