I am trying to find a way to detect , and or in a string even if they are repeated. So even a string such as one , , or or, two with re.split() should return "one" and "two".
So far this is what I have (Using Python 3.10):
import re
pattern = re.compile(r"(?:\s*,\s*or\s*|\s*,\s*|\s or\s ) ", flags=re.I)
string = "one,two or three , four or five or , or six , oR , seven, ,,or, ,, eight or qwertyor orqwerty,"
result = re.split(pattern, string)
print(result)
which returns:
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'qwertyor orqwerty', '']
My issue so far is if I have consecutive or, my pattern will only recognize every other or. For example:
string = "one or or two"
>>> ['one', 'or two']
string = "one or or or two"
>>> ['one', 'or', 'two']
Notice in the first example the second element contains or and in the second example or is an element by itself.
Is there a way to get around this? Also if there is a better way of separating these strings that would be greatly appreciated as well.
CodePudding user response:
You can use
import re
text = "one,two or three , four or five or , or six , oR , seven, ,,or, ,, eight or qwertyor orqwerty,"
print( re.split(r'(?:\s*(?:,|\bor\b)) \s*', text.rstrip().rstrip(',')) )
# => ['one', 'two', 'three', 'four', 'five', 'six', 'oR', 'seven', 'eight', 'qwertyor orqwerty']
See the Python demo and the regex demo.
Details:
(?:\s*(?:,|\bor\b))- one or more repetitions of\s*- zero or more whitespaces(?:,|\bor\b)- either a comma or a whole wordor
\s*- zero or more whitespaces.
Note the use of non-capturing groups, this is crucial since you are using the pattern in re.split.
Also, note the text.rstrip().rstrip(',') so that there is no trailing empty item in the result.
CodePudding user response:
Does Python support the word boundary flag \b? If so, you could probably simplify the regular expression to something along the following lines:
\s*((,|\bor\b)\s*)
