Capture all occurences of substring after specific text regex python-CodePudding

I have a long document in which the line of my interest starts with Categories : . I want to find all words separated by , after Categories : . Here's an example line

Categories : Turbo Prop , Very Light , Light , Mid Size

I want to find start index and end index of Turbo Prop, Very Light, Light, Mid Size

I am using following code

regex_pattern = r"(?<=Categories : )([A-Za-z ] (?:,)?) "

matched_text = regex.search(regex_pattern,doc_tex)

But matched_text.groups() is only giving Mid Size. In short, I want to find all occurences of group 1 after Categories.

CodePudding user response：

Do it in two steps. First split the line using :, then split the second part using ,.

category_string = line.split(':')[1]
categories = category_string.split(',')

CodePudding user response：

It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss. It looks like Python's re module does not store all all instances of a repeated capture group; see issue 7132. The regex package, however, adds additional methods to handle repeated capture groups, including.

captures -Returns a list of the strings matched in a group or groups.
starts - Returns a list of the start positions.
ends - Returns a list of the end positions.
spans - Returns a list of the spans. Compare with matchobject.span([group]).

Hence, using the regex package with the matchedobject.starts and matchedobject.ends methods should work.