I have a BeautifulSoup script which scrapes the pages inside the hyperlinks on this page: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
My goal is to save CSV file with the file name as the webpage title. The title is the crypto address for the page it gathered data from.
For example, this web page: https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX
Would be saved as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv"
To save the webpage title as the csv name, I am using a piece of code which gathers the title from the webpage, and assigns it to a variable called filename.
This is my code which creates the filename:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
When the CSV is written using the filename, the data is not the same as the respective filename.
For example, when the script runs and saves the csv name as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv", the data in the CSV is not the correct data for the https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX web page.
What I think is happening is the script is reading the wrong web page title and thus the CSV is created using the incorrect filename.
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit < goal:
continue
if table:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{filename}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
Any help is greatly appreciated. Thank you.
CodePudding user response:
You're reopening the file every time through the for loop, which empties the file and loses what you wrote on the previous iterations.
You should open the file once before the loop so you can write everything.
Also, you should initialize datarows to an empty list when processing each file. Otherwise you're combining the rows of all the pages you're scraping.
if table:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
with open(f'{filename}.csv', 'w', newline='') as f:
fcsv = csv.writer(f)
datarows = []
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv.writerow(headers)
fcsv.writerows(datarows)
