I'm preparing for an exam, and here's the task:
Given a string which consists of uppercase letters of English alphabet, find the longest substring of it, which doesn't contain QW or WQ.
I know I could do re.split or something like that, but I made it a challenge for me to do it with a "regex matching" expression like len(max(re.findall(...), key=len)) without using split or other methods. Is it even possible?
To find all matching subtrings, I tried this:
list(map(lambda x: x[0], re.findall(r'(((?<!QW|WQ).) (?!QW|WQ))', text))
But this does match a substring which ends with WQ, for example. How do I fix this?
Let me also clarify something. Let's say the string is WQABCDEFGHQW. The answer in this case is QABCDEFGHQ, as this is exactly the longest substring that doesn't contain WQ or QW.
There can arise a related question. What if the task would be the same, except those Q-s in the string must not be matched, because they are still parts of WQ or QW. (so the answer would be just ABCDEFGH). In this case @Wiktor Stribiżew's answer is completely correct. And this is a difficult regular expression, and there also are some really helpful links in the comments, so I consider this answer helpful and very interesting as well. Please do not downvote it.
CodePudding user response:
You could use
(?:(?!Q(?!W)|(?<!W)Q).)
Python Example
import re
s = "HAKSDUWQHUPHPSAHFPUAHPUSNFJHJWQHPJQWPHJWQWASDIAS"
print(max(re.findall(r"(?:(?!Q(?!W)|(?<!W)Q).) ", s), key=len)) # 'HUPHPSAHFPUAHPUSNFJHJW'
See the regex demo
CodePudding user response:
.?(?:.(?<!QW|WQ))*
Any single character is ok. Any further character is ok unless it creates QW or WQ.
Python demo along with two split-solutions:
input: 'AQWBQWC':
find: ['AQ', 'WBQ', 'WC', '']
split1: ['AQ', 'WBQ', 'WC']
split2: ['AQ', 'WBQ', 'WC']
input: 'AQWBWQC':
find: ['AQ', 'WBW', 'QC', '']
split1: ['AQ', 'WBW', 'QC']
split2: ['AQ', 'WBW', 'QC']
input: '':
find: ['']
split1: ['']
split2: ['']
input: 'A':
find: ['A', '']
split1: ['A']
split2: ['A']
input: 'Q':
find: ['Q', '']
split1: ['Q']
split2: ['Q']
input: 'QW':
find: ['Q', 'W', '']
split1: ['Q', 'W']
split2: ['Q', 'W']
(The extra empty matches don't matter, just like other non-longest matches don't matter. Will get discarded by your max(..., key=len).)
Code (Try it online!):
import re
find = r'.?(?:.(?<!QW|WQ))*'
split1 = r'(?<=Q)(?=W)|(?<=W)(?=Q)'
split2 = r'(?=.(?<=QW|WQ))'
for s in 'AQWBQWC', 'AQWBWQC', '', 'A', 'Q', 'QW':
print('input: ', repr(s) ':')
print('find: ', re.findall(find, s))
print('split1:', re.split(split1, s))
print('split2:', re.split(split2, s))
print()
CodePudding user response:
You can use
(?:[A-Z](?<!QW|WQ)(?<!Q(?=W)|W(?=Q)))
See the regex demo.
Details:
(?:- start of a non-capturing group:[A-Z]- an uppercase ASCII letter(?<!QW|WQ)- a negative lookbehind that fails the match if, immediately on the left, there is aQWorWQsubstring (i.e. if the[A-Z]matchedW(preceded withQ) orQ(preceded withW))(?<!Q(?=W)|W(?=Q))- a negative lookbehind that fails the match if, immediately on the left, there is aQimmediately followed withW, orWthat is immediately followed withQ(i.e. if[A-Z]matchedQand the next char isW, or if[A-Z]matchesWand the next char isQ)
)- end of the group, one or more occurrences.
Another approach:
(?:(?!QW|WQ)[A-Z](?<!QW|WQ))
See this regex demo. Details:
(?:- start of a non-capturing group:(?!QW|WQ)- a negative lookahead that fails the match if, immediately on the right, there is aQWorWQsubstring[A-Z]- an uppercase ASCII letter(?<!QW|WQ)- a negative lookbehind that fails the match if, immediately on the left, there is aQWorWQsubstring
)- end of the group, one or more occurrences.
CodePudding user response:
You could use that: (?:(?!WQ|QW).(?<!WQ|QW))
At each position (of the dot) it tests forward and backward if the character isn't a part of WQ or QW.
The key is to test forward before the dot and backward after the dot position.
Other possible pattern: (?:.(?!.?(?<=WQ|QW)))
This time a lookbehind is inside a negative lookahead but since it is preceded by an optional character .?, it checks the two possible positions of the WQ/QW sequence.
Or to reduce the number of steps an unrolled pattern:
[^WQ] (?:[QW](?!.?(?<=QW|WQ))[^WQ]*)*|(?:[QW](?!.?(?<=QW|WQ))[^WQ]*)
But it's a bit long (it's unrolled).
