This might seem to be a repetitive question here but I have tried all other SO posts and the suggestions are not working for me.
Basically, I want to exclude strings that have a particular substring in them, either at the beginning, middle or at the end.
Here is an example,
Max_Num_HR, HR_Max_Num, Max_HR_Num
I want to exclude the strings that contain either _HR (at the end), HR_(at the beginning) or _HR_ (in between)
What I have tried so far:
r"(^((?!HR_).*))(?<!_HR)$"
This will successfully exclude strings that have HR_ (at the beginning) and _HR (at the end), but not _HR_ (in between)
I have looked at How to exclude a string in the middle of a RegEx string?
But their solution did not seem to work for me.
I understand that the first segment of my code (^((?!HR_).*)) will exclude everything that contains HR_ since I have a ^ at the beginning followed by a negative lookahead. The second segment (?<!_HR)$ will begin at the end of the string and perform a negative lookbehind to see if _HR is not included at the end. Going with this train of thought, I tried including (?!_HR_) in between the two segments, but to no avail.
So, how do I get it to exclude all three HR_, _HR_, _HR considering Max_Num_HR, HR_Max_Num, Max_HR_Num as the test case?
CodePudding user response:
The pattern is missing the assertion for _HR_ somewhere in the string.
You can add the negative lookbehind to assert not _HR at the end after the dollar sign like $(?<!_HR) to prevent some backtracking over the .
Note that for a match only you don't need the capture groups.
^(?!HR_)(?!.*_HR_). $(?<!_HR)
^Start of string(?!HR_)Assert notHR_at the start(?!.*_HR_)Assert not_HR_in the string. $Match 1 chars to not match an empty string, and assert end of string(?<!_HR)Assert not_HRto the left
CodePudding user response:
One way to avoid matching strings that contain 'HR_' at the beginning, '_HR_' in the middle or '_HR' at the end is to match a regular expression having a beginning-of-string anchor followed by a negative lookahead, followed by .*:
^(?!HR_|. _HR_.|. _HR$).*
Note that lines containing '_HR_' at the beginning or end are matched, as per the specification.
The negative lookahead reads, "do not match 'HR_' at the beginning of the string or '_HR_' when preceded by at least one character and followed by one character (possibly more than one) or '_HR' at the end of the string.
The entire string is matched if and only if the negative lookahead succeeds.
The negative lookahead could of course be replaced by three negative lookaheads:
^(?!HR_)(?!. _HR_.)(?!. _HR$).*
