How to convert a string like "GHYTRDXK(something)YT(something else)YTRP(still something else)" into a list of cumulative strings disregarding the parentheses and their contents?
["GHYTRDXK", "GHYTRDXKYT", "GHYTRDXKYTYTRP"]
Note the number of parentheses can be more than 3.
CodePudding user response:
you can use re.split(pattern,string), then process found words to your desired result. Filter removes empty string from matched words
import re
txt = "GHYTRDXK(something)YT(something else)YTRP(still something else)"
words = filter(None, re.split(r'\([^\)]*\)', txt))
res = []
cur = ""
for word in words:
cur = word
res.append(cur)
print(res)
CodePudding user response:
Using re.findall (to split and avoid empty parts at the same time) and using itertools.accumulate.
>>> import re
>>> from itertools import accumulate
>>>
>>> txt = "GHYTRDXK(something)YT(something else)YTRP(still something else)"
>>>
>>> list(accumulate(re.findall(r'(?<![^)]). ?(?![^(])', txt)))
['GHYTRDXK', 'GHYTRDXKYT', 'GHYTRDXKYTYTRP']
The pattern starts and ends with lookarounds. These lookarounds use a double negation, example: (?![^(]) not followed by a character that isn't a (.
With this double negation the lookaround covers the two cases:
- followed by a
( - followed by the end of the string.
CodePudding user response:
You can use the re.split(pattern, string) method to split a string using a Regex pattern as the delimiter.
For the example input provided, something like this should work:
import re
def split_string(input):
return re.split(r'\(. ?\)', input)[:-1]
Note that this solution specifically assumes the input ends with something inside parentheses, as is the case in the example input, and accounts for this by removing the last element of the array ([:-1]). If this is not guaranteed in real input, you would have to filter empty strings from the array.
Regex explanation:
\(. ?\)
\( - escaped character, matches literal '('
. - matches any character except line breaks
- match 1 or more of the preceding token
? - makes the preceding quantifier lazy, causing it to match as few characters as possible
\) - escaped character, matches literal ')'
