I want to get just URL without redirect the link. my code is:
html = '<a href="/biz_redir?url=https://aceplumbingandrooter.com&cachebuster=1642876680&website_link_type=website&src_bizid=hqjCHBGnEj4nECnLJBvjQw&s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f" rel="noopener nofollow" role="link" target="_blank">https://aceplumbingandrooter.c…</a>'
soup=BeautifulSoup(html,'lxml')
in tag a ['href'] content :
href="/biz_redir?url=https://aceplumbingandrooter.com&cachebuster=1642876680&website_link_type=website&src_bizid=hqjCHBGnEj4nECnLJBvjQw&s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f"
I want just the link URL: aceplumbingandrooter.com
CodePudding user response:
You can use urllib.parse package. The URL you are looking for is indeed one of the parameters of the /biz_redir, so we need to first get the 'url' parameter out of it.
from urllib.parse import urlparse, parse_qs
url = '/biz_redir?url=https://aceplumbingandrooter.com&' \
'cachebuster=1642876680&website_link_type=website&' \
'src_bizid=hqjCHBGnEj4nECnLJBvjQw&s=2caa69aa7350cca9ad00' \
'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'
parsed_url = urlparse(url)
print(parse_qs(parsed_url.query)['url'][0])
This gives you full URL https://aceplumbingandrooter.com. You can then parse it further and get the netloc, here is complete code:
from urllib.parse import urlparse, parse_qs
url = '/biz_redir?url=https://aceplumbingandrooter.com&' \
'cachebuster=1642876680&website_link_type=website&' \
'src_bizid=hqjCHBGnEj4nECnLJBvjQw&s=2caa69aa7350cca9ad00' \
'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'
parsed_url = urlparse(url)
new = parse_qs(parsed_url.query)['url'][0]
new = urlparse(new)
print(new.netloc)
output:
aceplumbingandrooter.com
