Home > Enterprise >  How to fix the python codes to extract full links from a webpage? Available codes extracted partial
How to fix the python codes to extract full links from a webpage? Available codes extracted partial

Time:01-05

I am a beginner with python and using BeautifulSoup to extract links from the following webpage https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital. All available codes are like the follows,

html_page = urllib.request.urlopen("https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital"
soup = BeautifulSoup(html_page)
for link in soup.find_all('a'):
    print(link.get('href'))

The outputs include partial links, such as "/providers", etc. It should be "https://mhealthfairview.org/providers". Is there any way I can extract the full link rather than the partial link? Thank you.

CodePudding user response:

Use urllib.parse.urljoin

from urllib.parse import urljoin

url = "https://mhealthfairview.org/locations/m-health-fairview-st-johns-hospital"
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page)
for link in soup.find_all('a'):
    print(urljoin(url, link.get('href')))

CodePudding user response:

You can simply use an if.

webroot = 'https://mhealthfairview.org'
href = link.get('href')
if href[0] == "/":
 print(webroot   href)
  •  Tags:  
  • Related