In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...
CodePudding user response:
You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme netloc path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.
CodePudding user response:
Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])
CodePudding user response:
It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
