How to compare URLs in python? (not traditional way)?-CodePudding

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:

https://hello.com?test=test and https://hello.com?test22=test22

https://hello.com and https://hello.com#you_can_ignore_this

Is there any build-in function instead of working hard to compare every char etc...

CodePudding user response：

You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme netloc path):

from urllib.parse import urlparse

url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')

url1[:3]
# ('https', 'hello.com', '/')

url1[:3] == url2[:3]
# True

Comparing only the netloc (aka "domain"):

url1[1] == url2[1]

As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.

CodePudding user response：

Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.

from urllib.parse import urlparse

url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')

result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
         or (url1[:3] == url2[:3])

CodePudding user response：

It's not very clear what you mean, but you should try parsing the url first.

You could check it using urlparse().

from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")

Since the urlparse method returns a ParseResult:

ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')

You can compare these by doing

url[1] == 'hello.com' #Index 1 = netloc

https://docs.python.org/3/library/urllib.parse.html