Regex match numbers start with = or / but not match any digits with .

Hi i need match that pattern: Only numbers.

https://www.example.com/4145
https://www.example.com/45733
https://example.com/list/05.htm
https://example.com/list/06.htm
https://example.com/document/09.htm
https://example.com/list/09
https://example.com/page/07

Here my pattern

(?<=[=|\/|-])\d{1,6}(?![\._-])

Problem: How to not include something like this

https://www.example.com/107.150.126.47.html
https://www.example.com/107_150_126_47.html
https://www.example.com/107-150-126-47.html

Here regex link => Link

I need match digits from 1 to 6 that starts with / or = (match only digits), but not match something like this example.com/107.150.126.47.html (dots or dash or underscore)

CodePudding user response：

According to your examples, this will work:

(?<=[=/-])\d{1,6}(?!([\.-_]?.*html))

Some explanations:

It will match "1 to 6 consecutive digits" that must be preceded by "= or / or -," and these digits must NOT be followed by ". or - or _ and end with html".

CodePudding user response：

You need to nail down what you want allowed/excluded after your run of digits (ie the lookahead specification).

A very loose spec would allow anything other than

(underscore) or (hyphen) or (dot followed by digit)

A very tight spec would allow only

(end of line) or (dot followed by 'htm')

There are many other possibilities.

CodePudding user response：

Use the following regex.

(\/|=)(\d )(\.html?)

(\/|=) ensures the number comes after a / or an = sign
(\d ) this is the actual number, we use modifier to match numbers greater than 9
(\.html?) last group is just to ensure the matched number has html or htm as suffix

In python, it'd look like:

import re

s = """https://www.example.com/107.150.126.47.html
https://www.example.com/107_150_126_47.html
https://www.example.com/107-150-126-47.html
https://www.example.com..=4145...
https://www.example.com../45733...
h://w.com/03.html
h://w.com/13.htm
h://w.com/?w=03341.html"""

matches = re.findall(r'(\/|=)(\d )(\.html?)', s)
print(matches)
# [('/', '03', '.html'), ('/', '13', '.htm'), ('=', '03341', '.html')]