Home > Mobile >  python regex optional capture group question
python regex optional capture group question

Time:01-08

I have the following problem matching, the data like below:

LABEL: TEXT1 TEXT2

The "LABEL" and "TEXT2" is optional, and separate by "\s"(blank or indent).

I would like to have 3 capture groups giving as output :

group1:LABEL(or None)
group2:TEXT1
group3:TEXT2(or None)

So I've written a regex like this :

\s*(\S*(?::))?\s*(\S*)\s*(.*$)

And the result is:

group1:LABEL:
group2:TEXT1
group3:TEXT2(or Null)

The problem is why group1 contain ":"?

And when TEXT2 doesn't exist, group3 is null instead of None

CodePudding user response:

Consider this approach:

inp = ["LABEL: TEXT1 TEXT2", "LABEL: TEXT1", "TEXT1 TEXT2", "TEXT1"]
matches = [re.findall(r'(?:(\w ):)?\s*(\w )(?:\s (\w ))?', x) for x in inp]
print(matches)

# [[('LABEL', 'TEXT1', 'TEXT2')],
   [('LABEL', 'TEXT1', '')],
   [('', 'TEXT1', 'TEXT2')],
   [('', 'TEXT1', '')]]

This regex pattern says to match:

(?:(\w ):)?    an optional leading word term followed by colon
\s*            optional whitespace
(\w )          mandatory middle word term
(?:\s (\w ))?  optional space followed a final word term

CodePudding user response:

You can use

^(?:(.*?):)?\s*(\S )(?:\s (\S.*))?$

See the regex demo. Details:

  • ^ - start of string
  • (?:(.*?):)? - an optional sequence of any zero or more chars other than line break chars as few as possible captured into Group 1 and then a : char
  • \s* - zero or more whitespaces
  • (\S ) - Group 2: one or more non-whitespace chars
  • (?:\s (\S.*))? - an optional sequence of one or more whitespaces and then Group 3 capturing a non-whitespace and then any zero or more chars other than line break chars as many as possible
  • $ - end of string

See the Python demo:

import re
texts = ['L A B12 E4L-0: TEXT1 TEXT 2 HEL-L!O!!!', 'TEXT1 TEXT 2 HEL-L!O!!!', 'L A B12 E4L-0: TEXT1', 'TEXT']
rx = re.compile(r'^(?:(.*?):)?\s*(\S )(?:\s (\S.*))?$')
for text in texts:
    m = rx.search(text)
    if m:
        print(m.groups())

Output:

('L A B12 E4L-0', 'TEXT1', 'TEXT 2 HEL-L!O!!!')
(None, 'TEXT1', 'TEXT 2 HEL-L!O!!!')
('L A B12 E4L-0', 'TEXT1', None)
(None, 'TEXT', None)
  •  Tags:  
  • Related