I have a long document in which the line of my interest starts with Categories : . I want to find all words separated by , after Categories : .
Here's an example line
Categories : Turbo Prop , Very Light , Light , Mid Size
I want to find start index and end index of Turbo Prop, Very Light, Light, Mid Size
I am using following code
regex_pattern = r"(?<=Categories : )([A-Za-z ] (?:,)?) "
matched_text = regex.search(regex_pattern,doc_tex)
But matched_text.groups() is only giving Mid Size. In short, I want to find all occurences of group 1 after Categories.
CodePudding user response:
Do it in two steps. First split the line using :, then split the second part using ,.
category_string = line.split(':')[1]
categories = category_string.split(',')
CodePudding user response:
It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss. It looks like Python's re module does not store all all instances of a repeated capture group; see issue 7132. The regex package, however, adds additional methods to handle repeated capture groups, including.
- captures -Returns a list of the strings matched in a group or groups.
- starts - Returns a list of the start positions.
- ends - Returns a list of the end positions.
- spans - Returns a list of the spans. Compare with matchobject.span([group]).
Hence, using the regex package with the matchedobject.starts and matchedobject.ends methods should work.
