I try to prepare a pattern to find all the substrings from the text with this format
system:microsoft,
flow:to_server,
vho:file-was-closed,
heur250:unknown.file
Also, I want to exclude substrings where parts before or after : include only digits
03:00
file:123
I don't want to catch substrings where the part before : is equal to mailto
mailto:user
And I don't want to catch substrings where parts before or after : end with some extensions like jpg, png
cid:image003.png
I've written the pattern but it doesn't work properly.
pattern = r'(?!^\d $)(?!mailto)[\w\d\.-] :[\w\d\.-(?!(jpg|png))] '
Could you help me to change that and explain what I do wrong?
CodePudding user response:
Can you try:
(?<!\S)(?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$))[\w.-] :[\w.-] (?!\S)
See an online demo. Admittedly, the last part of the pattern can be more specific to avoid things like ...:... to be valid, but that's up to you I guess.
(?<!\S)- Assert position is not preceded by a non-whitespace;(?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$))- A negative lookahead with alternation: Avoid 'mailto:', avoid trailing '.jpg' or '.png' or just digits on either side of the colon;[\w.-] :[\w.-]- The pattern to match at least 1 characters from the given class on either side of the colon;(?!\S)- Assert position is not followed by a non-whitespace char.
CodePudding user response:
If your matches are inside whitespace boundaries, you can use
(?<!\S)(?!mailto:|\d :)[\w.-] (?<!\.jpg|\.png):(?!\d (?!\S))[\w.-] (?!\S)(?<!\.jpg|\.png)
See the regex demo.
Details:
(?<!\S)- left-hand whitespace boundary(?!mailto:|\d :)- immediately to the right, there can be nomailto:or one or more digits followed with a:char[\w.-]- one or more word,.or-chars(?<!\.jpg|\.png)- no.jpgor.pngimmediately to the left are allowed:- a colon(?!\d (?!\S))- only digits until the whitespace or end of string are allowed[\w.-]- one or more word,.or-chars(?!\S)- right-hand whitespace boundary(?<!\.jpg|\.png)- no.jpgor.pngimmediately to the left are allowed.
If your matches are located in any context you can use a solution like
import re
text = "system:microsoft flow:to_server vho:file-was-closed heur250:unknown.file, file.png:word, 03:00, file:123, mailto:user, cid:image003.png"
pattern = r'\bmailto:[\w.-] |\b\d :[\w.-] |[\w.-] :\d |[\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])|[\w.-]*\.(?:jpg|png):[\w.-] |([\w.-] :[\w.-] )'
print( [x for x in re.findall(pattern, text) if x!=''] )
See this Python demo.
Output:
['system:microsoft', 'flow:to_server', 'vho:file-was-closed', 'heur250:unknown.file']
Note that this solution is based on the "best regex trick ever". Details:
\bmailto:[\w.-] |- whole wordmailto:and then one or more word,.or-chars, or\b\d :[\w.-] |- word boundary, one or more digits,:, and then one or more word,.or-chars, or[\w.-] :\d |- one or more word,.or-chars,:, one or more digits[\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])|- one or more word,.or-chars,:, zero or more word,.or-chars, then.andjpgorpngnot followed with a word,.or-char, or[\w.-]*\.(?:jpg|png):[\w.-] |- zero or more word,.or-chars,.,jpgorpng,:, and then one or more word,.or-chars, or([\w.-] :[\w.-] )- Group 1 (we'll output this value only): one or more word,.or-chars,:, and one or more word,.and-chars.
All the parts before the last Group 1 pattern are there to filter out unwelcome matches.
