Home > Software design >  How can Beautifulsoup scrape the pages inside this list of hyperlinks?
How can Beautifulsoup scrape the pages inside this list of hyperlinks?

Time:01-24

I am trying to scrape the contents of the hyperlinks on the left side of this page. I am already able to scrape the contents of the hyperlinks, so now I am trying to run the script on each individual hyperlink that is on the left side of the page.

URL: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-3.html

I think what needs to be done is the url be a dynamic variable, and that variable is a loop which will go through all of the hyperlinks in the URL above. Although I'm not exactly sure if this is the best way to approach it, as this is my first project

Any advice is greatly appreciated.

Here is the code that I am trying to plug this into.

import csv
import requests
from bs4 import BeautifulSoup as bs

url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent": "Mozilla/5.0"}

r = requests.get(url, headers=headers)

soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
headers = []
datarows = []


#Get crypto address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin','')
finalitem = newitem.replace('Address','')



for row in table.find_all('tr'):
    heads = row.find_all('th')
    if heads:
        headers = [th.text for th in heads]
    else:
        datarows.append([td.text for td in row.find_all('td')])

fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)

CodePudding user response:

A simple way would be to make an initial request and extract all the links in the second column of the table.

Then loop those links, make requests, and continue with your existing code, except to also handle cases where no table present.

import csv
import requests
from bs4 import BeautifulSoup as bs

headers = []
datarows = []

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-3.html')
    soup = bs(r.content, 'lxml')
    address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
    
    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")
        
        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin','')
            finalitem = newitem.replace('Address','')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)
        else:
            print('no table for: ', url)
  •  Tags:  
  • Related