I have the following problem matching, the data like below:
LABEL: TEXT1 TEXT2
The "LABEL" and "TEXT2" is optional, and separate by "\s"(blank or indent).
I would like to have 3 capture groups giving as output :
group1:LABEL(or None)
group2:TEXT1
group3:TEXT2(or None)
So I've written a regex like this :
\s*(\S*(?::))?\s*(\S*)\s*(.*$)
And the result is:
group1:LABEL:
group2:TEXT1
group3:TEXT2(or Null)
The problem is why group1 contain ":"?
And when TEXT2 doesn't exist, group3 is null instead of None
CodePudding user response:
Consider this approach:
inp = ["LABEL: TEXT1 TEXT2", "LABEL: TEXT1", "TEXT1 TEXT2", "TEXT1"]
matches = [re.findall(r'(?:(\w ):)?\s*(\w )(?:\s (\w ))?', x) for x in inp]
print(matches)
# [[('LABEL', 'TEXT1', 'TEXT2')],
[('LABEL', 'TEXT1', '')],
[('', 'TEXT1', 'TEXT2')],
[('', 'TEXT1', '')]]
This regex pattern says to match:
(?:(\w ):)? an optional leading word term followed by colon
\s* optional whitespace
(\w ) mandatory middle word term
(?:\s (\w ))? optional space followed a final word term
CodePudding user response:
You can use
^(?:(.*?):)?\s*(\S )(?:\s (\S.*))?$
See the regex demo. Details:
^- start of string(?:(.*?):)?- an optional sequence of any zero or more chars other than line break chars as few as possible captured into Group 1 and then a:char\s*- zero or more whitespaces(\S )- Group 2: one or more non-whitespace chars(?:\s (\S.*))?- an optional sequence of one or more whitespaces and then Group 3 capturing a non-whitespace and then any zero or more chars other than line break chars as many as possible$- end of string
See the Python demo:
import re
texts = ['L A B12 E4L-0: TEXT1 TEXT 2 HEL-L!O!!!', 'TEXT1 TEXT 2 HEL-L!O!!!', 'L A B12 E4L-0: TEXT1', 'TEXT']
rx = re.compile(r'^(?:(.*?):)?\s*(\S )(?:\s (\S.*))?$')
for text in texts:
m = rx.search(text)
if m:
print(m.groups())
Output:
('L A B12 E4L-0', 'TEXT1', 'TEXT 2 HEL-L!O!!!')
(None, 'TEXT1', 'TEXT 2 HEL-L!O!!!')
('L A B12 E4L-0', 'TEXT1', None)
(None, 'TEXT', None)
