I have a question about the matching order of regex concatenated by | operator. I have this regex " ?\p{L} |\s ". For strings like inputs = " s", when I run re.findall(), it is split into " " and "s". My question is - how is the order determined? " ?\p{L} " should give " s", why is the space deleted in the final result? To clarify, I am using python regex.
To reproduce:
import regex as re
pat = re.compile(r" ?\p{L} |\s ")
inputs = " s"
print(re.findall(pat, inputs))
Many thanks to your help!
CodePudding user response:
Working of regex ?\p{L} |\s matches against input: " s":
- Matching of regular expression matching is from left to right.
- First it attempts to find a match for first alternation option
\p{L}in input and as you notice there is no match at the start of the input for this option. - Next it attempts to find a match for
\sand that results in a success hence first match is" ". - Now 5 spaces have been consumed in this match and pointer moves to letter
s. - Then regex engine attempts to match
susing alternations again. - This time
?\p{L}is successful in matchingshence second match iss. - Regex engine stops at this point since it has reached to the end of input.
CodePudding user response:
You can use a negative lookahead pattern to avoid \s consuming the whitespace that ?\p{L} would match:
pat = re.compile(r" ?\p{L} |\s (?!\p{L})")
