I'm trying to parse a comma separated list with multiple capture groups in each element via regex.
Sample Text
col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37
I've tried using various variants of this regex
(.*?)\s?=\s?(.*?)\s?,?
But it never gives me what I want or if it gets close it can't cope with there being just one element or vice versa.
What I'm expecting is a list of Matches with 3 groups
Match1 group 0 the whole match
Match1 group 1 col1
Match1 group 2 'Test String'
Match2 group 0 the whole match
Match2 group 1 col2
Match2 group 2 'Next Test String'
Match3 group 0 the whole match
Match3 group 1 col3
Match3 group 2 'Last Test String'
Match4 group 0 the whole match
Match4 group 1 col4
Match4 group 2 37
(Note I'm only interested in groups 1 & 2)
I'm deliberately making this non language specific as I can't get it to work in online Regex debuggers, however, my target language is Python 3
Thank you in advance and I hope I've made myself clear
CodePudding user response:
The (.*?)\s?=\s?(.*?)\s?,? regex has got only one obligatory pattern, =. The (.*?) at the start gets expanded up to the leftmost = and the group captures any text up to the leftmost = and an optional whitespace after it. The rest of the subpatterns do not have to match, if there is a whitespace, it is matched with \s?, if there are two, they are matched, too, and if there is a comma, it is also matched and consumed, the .*? part is simply skipped as it is lazy.
If you want to get the second capturing group with single quotes included, you can use
(?:,|^)\s*([^\s=] )\s*=\s*('[^']*'|\S )
See this regex pattern. It matches
(?:,|^)- a non-capturing group matching a,or start of string\s*- zero or more whitespaces([^\s=] )- Group 1: one or more chars other than whitespace and=\s*=\s*- a=char enclosed with zero or more whitespaces('[^']*'|\S )- Group 2: either', zero or more non-'s, and a', or one or more non-whitespaces.
If you want to exclude single quotes you can post-process the matches, or use an extra capturing group in '([^']*)', and then check if the group matched or not:
import re
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
pattern = r"([^,\s=] )\s*=\s*(?:'([^']*)'|(\S ))"
matches = re.findall(pattern, text)
print( dict([(x, z or y) for x,y,z in matches]) )
# => {'col1': 'Test String', 'col2': 'Next Test String', 'col3': 'Last Text String', 'col4': '37'}
See this Python demo.
If you want to do that with a pure regex, you can use a branch reset group:
import regex # pip install regex
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
print( dict(regex.findall(r"([^,\s=] )\s*=\s*(?|'([^']*)'|(\S ))", text)) )
See the Python demo (regex demo).
