I'm working with ~1800 whole genome sequences of SARS-CoV-2 and I want to keep only the "EPI_ISL_NC045512" pattern, which is between two "|". This would be my string:
>New|hCoV-19/Belize/BZ-CML-TCMC-BZ002-0820/2020|EPI_ISL_NC045512|2020-08-12NC045512
actcacgcagtataattaataactaattactgtcgttgacaggacacgagtaactcgtctatcttctgcaggctgcttacggtttcgtccgtg
I would need to also keep the ">" I tried (>)(. )([EPI. ])(. ) but it didn't work
CodePudding user response:
A simple one could be this one: |(EPI([A-Z0-9_] ))|
Assuming only A-Z 0-9 and _ on your pattern, the result is in group 1 (surrounded by parenthesis).
CodePudding user response:
You could use 2 capture groups if you want to keep > in a group and EPI_ISL_NC045512 in a group
(>)[^>]*\|(EPI[^|]*)\|
(>)Capture>in group 1[^>]*\|Optionally match any char except>and then match|(EPI[^|]*)Capture EPI followed by any char except|in group 2\|Match|
