I have two example strings, which I would like to split by either ", " (if , is present) or " ".
x = ">Keratyna 5, egzon 2, Homo sapiens"
y = ">101m_A mol:protein length:154 MYOGLOBIN"
The split should be performed just once to recover two pieces of information:
id, description = re.split(pattern, string, maxsplit=1)
For ">Keratyna 5, egzon 2, Homo sapiens" -> [">Keratyna 5", "egzon 2, Homo sapiens"]
For ">101m_A mol:protein length:154 MYOGLOBIN" -> [">101m_A", "mol:protein length:154 MYOGLOBIN"]
I came up with the following patterns:
",\\s |\\s ", ",\\s |^,\\s ", "[,]\\s |[^,]\\s ",
but none of these work.
The solution I made is using an exception:
try:
id, description = re.split(",\s ", description, maxsplit=1)
except ValueError:
id, description = re.split("\s ", description, maxsplit=1)
but honestly I hate this workaround. I haven't found any suitable regex pattern yet. How should I do it?
CodePudding user response:
You can use
^((?=.*,)[^,] |\S )[\s,] (.*)
See the regex demo. Details:
^- start of string((?=.*,)[^,] |\S )- Group 1: if there is a,after any zero or more chars other than line break chars as many as possible, then match one or more chars other than,, or match one or more non-whitespace chars[\s,]- zero or more commas/whitespaces(.*)- Group 2: zero or more chars other than line break chars as many as possible
See the Python demo:
import re
pattern = re.compile( r'^((?=.*,)[^,] |\S )[\s,] (.*)' )
texts = [">Keratyna 5, egzon 2, Homo sapiens", ">101m_A mol:protein length:154 MYOGLOBIN"]
for text in texts:
m = pattern.search(text)
if m:
id, description = m.groups()
print(f"ID: '{id}', DESCRIPTION: '{description}'")
Output:
ID: '>Keratyna 5', DESCRIPTION: 'egzon 2, Homo sapiens'
ID: '>101m_A', DESCRIPTION: 'mol:protein length:154 MYOGLOBIN'
CodePudding user response:
You could either split on the first occurrence of , or split on a space that is no occurrence of , to the right using an alternation:
, | (?!.*?, )
The pattern matches:
,Match,|Or(?!.*?, )Negative lookahead, assert that to the right is not,
See a Python demo and a regex demo.
Example
import re
strings = [
">Keratyna 5, egzon 2, Homo sapiens",
">101m_A mol:protein length:154 MYOGLOBIN"
]
for s in strings:
print(re.split(r", | (?!.*?, )", s, maxsplit=1))
Output
['>Keratyna 5', 'egzon 2, Homo sapiens']
['>101m_A', 'mol:protein length:154 MYOGLOBIN']
