Creating multiple rows from a complex string in R-CodePudding

Example of string in question called gene_snps:

"ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A"

Desired outcome:

markerID
chr9:23143143_A/C
chr9:5322432_G/T
chr9:9840984342_T/C
chr9:5324234:G/T
chr9:324424_T/A

With the ultimate outcome being a table:

CHR POS REF ALT
chr9 23143143 A C
chr9 5322432 G T
chr9 9840984342 T C
chr9 5324234 G T
chr9 324424 T A

I used to have a code to split these up when they were only ";" separated using:

x <- separate_rows(gene_snps, markerIDs, sep=c(";"))
x <- separate_rows(x, col="markerIDs", into=c("pos", "ref_alt"), sep=c("_"))
x <- separate_rows(x, col="pos", into=c("CHR", "POS"), sep=c(":"))
x <- separate_rows(x, col="ref_alt", into=c("REF", "ALT"), sep=c("/"))

but that is now out as the upstream tool used to generate the code now introduces the "ultra_rare" tag which is all "_" separated.

Any help with splitting this string up to get rid of the ultra_rare_variant bit and split each chrx:x_x/x chunk into it's own row would be much appreciated!

All the best

CodePudding user response：

We just need to use a regex with a lookahead:

library(stringr)

x = str_split(str, '[_;](?=chr)', simplify = T)[-1]
x

[1] "chr9:23143143_A/C"   "chr9:5322432_G/T"    "chr9:9840984342_T/C" "chr9:5324234:G/T"   
[5] "chr9:324424_T/A"

This matches either a "_" or a ";" immediately proceeding "chr", then uses that to split the string. The argument simplify=T makes it give the result as a vector instead of a list, and we use [-1] to drop the first element, which is "ultra_rare_variant"

To make it a table, we can just split it again on either ":", "/", or "_" and convert to data frame. Since there are multiple strings and multiple split sites, the argument simplify=T give us a matrix (each row is a string in x, each column is a piece after splitting), which can be converted into a data.frame:

tbl <- as.data.frame(str_split(x, '[:_/]', simplify = TRUE))

colnames(tbl) <- c('CHR', 'POS', 'REF','ALT') # Set column names
tbl

   CHR        POS REF ALT
1 chr9   23143143   A   C
2 chr9    5322432   G   T
3 chr9 9840984342   T   C
4 chr9    5324234   G   T
5 chr9     324424   T   A

CodePudding user response：

Use stringr::str_match_all to directly save everything into named capture groups.

x <- "ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A"
pattern <- "(?<CHR>chr\\d)?[_:/](?<POS>\\d )?[_:/](?<REF>[A-Z])?[_:/](?<ALT>[A-Z])?"
as.data.frame(stringr::str_match_all(x, pattern)[[1L]][, -1L])

Output

   CHR        POS REF ALT
1 chr9   23143143   A   C
2 chr9    5322432   G   T
3 chr9 9840984342   T   C
4 chr9    5324234   G   T
5 chr9     324424   T   A

CodePudding user response：

Another possible solution:

library(tidyverse)

s <- "ultra_rare_variant_chr9:23143143_A/C_chr9:5322432_G/T_chr9:9840984342_T/C;chr9:5324234:G/T;chr9:324424_T/A" 

data.frame(CHR = s %>% str_remove("ultra_rare_variant_")) %>% 
  separate_rows(CHR, sep=";|_(?=chr9)") %>% 
  separate(CHR, into = c("CHR","POS","REF","ALT"), sep=":|_|/")

#> # A tibble: 5 × 4
#>   CHR   POS        REF   ALT  
#>   <chr> <chr>      <chr> <chr>
#> 1 chr9  23143143   A     C    
#> 2 chr9  5322432    G     T    
#> 3 chr9  9840984342 T     C    
#> 4 chr9  5324234    G     T    
#> 5 chr9  324424     T     A