I am using BeautifulSoup to scrape webpages from this URL: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
I am able to scrape the web pages inside the hyperlinks on the left side, but now I am trying to create some parameters for which pages I scrape. The parameter that I am working with is the "Last Out" date on the right side. Basically, I am trying to only scrape web pages which have a Last Out as a certain date. Example, only scrape pages that have a last out of after 1-1-2020.
What I think needs to be done is for there to be an if statement and if the date is higher than 1-1-2020, then it will continue on to scrape the respective hyperlink. I am not really sure though, or if it's possible to do this with Beautiful Soup.
I appreciate any help, ideas, or advice.
import csv
import requests
from bs4 import BeautifulSoup as bs
headers = []
datarows = []
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
if table:
item = soup.find('h1').text
newitem = item.replace('Dogecoin','')
finalitem = newitem.replace('Address','')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
CodePudding user response:
Using the datetime library is the best way to do this since it allows for easy comparison date/time comparison. I was able to implement it in your code. I left some comments to explain the code:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
if table:
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
Leave a comment if you have any questions about how it works that my comments didn't answer, I'd be happy to answer them!
CodePudding user response:
You are correct, you need to do a date comparison but in order to do that you need to convert the date from a string to a datetime object. Have a look at the datetime module and specifcally the strptime() method to convert a string to a datetime object.
