BeautifulSoup Scraping Elements Containing Certain Date-CodePudding

I am using BeautifulSoup to scrape webpages from this URL: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html

I am able to scrape the web pages inside the hyperlinks on the left side, but now I am trying to create some parameters for which pages I scrape. The parameter that I am working with is the "Last Out" date on the right side. Basically, I am trying to only scrape web pages which have a Last Out as a certain date. Example, only scrape pages that have a last out of after 1-1-2020.

What I think needs to be done is for there to be an if statement and if the date is higher than 1-1-2020, then it will continue on to scrape the respective hyperlink. I am not really sure though, or if it's possible to do this with Beautiful Soup.

I appreciate any help, ideas, or advice.

import csv
import requests
from bs4 import BeautifulSoup as bs

headers = []
datarows = []

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')
    address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
    
    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")
        
        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin','')
            finalitem = newitem.replace('Address','')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)

CodePudding user response：

Using the datetime library is the best way to do this since it allows for easy comparison date/time comparison. I was able to implement it in your code. I left some comments to explain the code:

import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime

headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')

    # select all tr elements (minus the first one, which is the header)
    table_elements = soup.select('tr')[1:]
    address_links = []
    for element in table_elements:
        children = element.contents  # get children of table element
        url = children[1].a['href']
        last_out_str = children[8].text
        # check to make sure the date field isn't empty
        if last_out_str != "":
            # load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
            last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
            # if check to see if the date is after 2020/1/1
            if last_out > after_date:
                address_links.append(url)

    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")

        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin', '')
            finalitem = newitem.replace('Address', '')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)

Leave a comment if you have any questions about how it works that my comments didn't answer, I'd be happy to answer them!

CodePudding user response：

You are correct, you need to do a date comparison but in order to do that you need to convert the date from a string to a datetime object. Have a look at the datetime module and specifcally the strptime() method to convert a string to a datetime object.