get url link from href by beautifulsoup without redirect link-CodePudding

I want to get just URL without redirect the link. my code is:

html = '<a  href="/biz_redir?url=https://aceplumbingandrooter.com&amp;cachebuster=1642876680&amp;website_link_type=website&amp;src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f" rel="noopener nofollow" role="link" target="_blank">https://aceplumbingandrooter.c…</a>'

soup=BeautifulSoup(html,'lxml')

in tag a ['href'] content :

href="/biz_redir?url=https://aceplumbingandrooter.com&amp;cachebuster=1642876680&amp;website_link_type=website&amp;src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f"

I want just the link URL: aceplumbingandrooter.com

CodePudding user response：

You can use urllib.parse package. The URL you are looking for is indeed one of the parameters of the /biz_redir, so we need to first get the 'url' parameter out of it.

from urllib.parse import urlparse, parse_qs

url = '/biz_redir?url=https://aceplumbingandrooter.com&amp;' \
      'cachebuster=1642876680&amp;website_link_type=website&amp;' \
      'src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00' \
      'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'

parsed_url = urlparse(url)
print(parse_qs(parsed_url.query)['url'][0])

This gives you full URL https://aceplumbingandrooter.com. You can then parse it further and get the netloc, here is complete code:

from urllib.parse import urlparse, parse_qs

url = '/biz_redir?url=https://aceplumbingandrooter.com&amp;' \
      'cachebuster=1642876680&amp;website_link_type=website&amp;' \
      'src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00' \
      'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'

parsed_url = urlparse(url)
new = parse_qs(parsed_url.query)['url'][0]
new = urlparse(new)
print(new.netloc)

output:

aceplumbingandrooter.com