Hi i need match that pattern: Only numbers.
https://www.example.com/4145
https://www.example.com/45733
https://example.com/list/05.htm
https://example.com/list/06.htm
https://example.com/document/09.htm
https://example.com/list/09
https://example.com/page/07
Here my pattern
(?<=[=|\/|-])\d{1,6}(?![\._-])
Problem: How to not include something like this
https://www.example.com/107.150.126.47.html
https://www.example.com/107_150_126_47.html
https://www.example.com/107-150-126-47.html
Here regex link => Link
I need match digits from 1 to 6 that starts with / or = (match only digits), but not match something like this example.com/107.150.126.47.html (dots or dash or underscore)
CodePudding user response:
According to your examples, this will work:
(?<=[=/-])\d{1,6}(?!([\.-_]?.*html))
Some explanations:
It will match "1 to 6 consecutive digits" that must be preceded by "= or / or -," and these digits must NOT be followed by ". or - or _ and end with html".
CodePudding user response:
You need to nail down what you want allowed/excluded after your run of digits (ie the lookahead specification).
A very loose spec would allow anything other than
(underscore) or (hyphen) or (dot followed by digit)
A very tight spec would allow only
(end of line) or (dot followed by 'htm')
There are many other possibilities.
CodePudding user response:
Use the following regex.
(\/|=)(\d )(\.html?)
(\/|=)ensures the number comes after a/or an=sign(\d )this is the actual number, we usemodifier to match numbers greater than 9(\.html?)last group is just to ensure the matched number has html or htm as suffix
In python, it'd look like:
import re
s = """https://www.example.com/107.150.126.47.html
https://www.example.com/107_150_126_47.html
https://www.example.com/107-150-126-47.html
https://www.example.com..=4145...
https://www.example.com../45733...
h://w.com/03.html
h://w.com/13.htm
h://w.com/?w=03341.html"""
matches = re.findall(r'(\/|=)(\d )(\.html?)', s)
print(matches)
# [('/', '03', '.html'), ('/', '13', '.htm'), ('=', '03341', '.html')]
