What I'm looking to do is to modify a regex (JS flavor) to not match if the pattern is both preceded and followed by the same string.
By way of a simple analogy, say I want to match all instances of n that are not both preceded and followed by e. So, for example, the regex should not match the n in alkene, but it should still match the n in pen or nest, which only have the e directly adjacent to n on one side, not both.
Most older threads I've seen trying to find an answer basically say "just use negative lookarounds", but the problem is that (?<!e)n(?!e) doesn't match any of those inputs - because the lookbehind and lookahead are processed by the regex engine separately, so it considers either condition to be sufficient to exclude the match.
(The real regex is (?<!¸ª)()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!¸ª) and it's failing to match the ɣʷ in t͡ʃe:h₁dɣʷo¸ªh₂¸ª, but that makes the problem look a lot harder to explain than it needs to be)
How do you modify a regex to only exclude patterns when they're nested?
CodePudding user response:
The (?<!b)a(?!b) pattern here must be replaced with (?<!b(?=ab))a or a(?!(?<=ba)b). The point is to call a reverse lookahead or lookbehind from lookbehind or lookahead.
See your pattern fix (without any optimizations) where I took the lookahead, pasted it inside lookbehind after ª, reversed the lookahead (i.e. made it positive) and added the whole pattern before ¸ª in the lookahead to be able to get to the right-hand ¸ª:
(?<!¸ª(?!(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)
Or, if you put the lookbehind into lookahead:
()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!(?<=¸ª(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a))¸ª)
See the regex demo (and regex demo #2).
Whenever your pattern is simple, it is best not to repeat the pattern in the lookarounds, you may usually just use . or .{x} where x stands for the number of chars your consuming pattern part can match. Here, it is not clear how many chars the pattern can actually match, you may probably use (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a), but I do not have any edge cases to test against.
Enhancing this further may yield (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|[hr]₂|r₃|w|j)([eoøɑiɚyua]) (demo).
