Home > Blockchain >  I am saving BeautifulSoup results to CSV file with webpage title as filename, but the filename isn&#
I am saving BeautifulSoup results to CSV file with webpage title as filename, but the filename isn&#

Time:02-03

I have a BeautifulSoup script which scrapes the pages inside the hyperlinks on this page: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html

My goal is to save CSV file with the file name as the webpage title. The title is the crypto address for the page it gathered data from.

For example, this web page: https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX

Would be saved as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv"

To save the webpage title as the csv name, I am using a piece of code which gathers the title from the webpage, and assigns it to a variable called filename.

This is my code which creates the filename:

    ad2 = (soup.title.string)
    ad2 = ad2.replace('Dogecoin', '')
    ad2 = ad2.replace('Address', '')
    ad2 = ad2.replace('-', '')
    filename = ad2.replace(' ', '')

When the CSV is written using the filename, the data is not the same as the respective filename.

For example, when the script runs and saves the csv name as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv", the data in the CSV is not the correct data for the https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX web page.

What I think is happening is the script is reading the wrong web page title and thus the CSV is created using the incorrect filename.

import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime

headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')

    # select all tr elements (minus the first one, which is the header)
    table_elements = soup.select('tr')[1:]
    address_links = []
    for element in table_elements:
        children = element.contents  # get children of table element
        url = children[1].a['href']
        last_out_str = children[8].text
        # check to make sure the date field isn't empty
        if last_out_str != "":
            # load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
            last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
            # if check to see if the date is after 2020/1/1
            if last_out > after_date:
                address_links.append(url)

    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")


        #Get the profit

        sections = soup.find_all(class_='table-striped')

        for section in sections:
            oldprofit = section.find_all('td')[11].text
            removetext = oldprofit.replace('USD', '')
            removetext = removetext.replace(' ', '')
            removetext = removetext.replace(',', '')
            profit = float(removetext)

        # Compare profit to goal

        goal = float(50000)

        if profit < goal:
            continue

        if table:

            ad2 = (soup.title.string)
            ad2 = ad2.replace('Dogecoin', '')
            ad2 = ad2.replace('Address', '')
            ad2 = ad2.replace('-', '')
            filename = ad2.replace(' ', '')

            for row in table.find_all('tr'):
                        heads = row.find_all('th')
                        if heads:
                            headers = [th.text for th in heads]
                        else:
                            datarows.append([td.text for td in row.find_all('td')])

                        fcsv = csv.writer(open(f'{filename}.csv', 'w', newline=''))
                        fcsv.writerow(headers)
                        fcsv.writerows(datarows)

Any help is greatly appreciated. Thank you.

CodePudding user response:

You're reopening the file every time through the for loop, which empties the file and loses what you wrote on the previous iterations.

You should open the file once before the loop so you can write everything.

Also, you should initialize datarows to an empty list when processing each file. Otherwise you're combining the rows of all the pages you're scraping.

if table:
    ad2 = (soup.title.string)
    ad2 = ad2.replace('Dogecoin', '')
    ad2 = ad2.replace('Address', '')
    ad2 = ad2.replace('-', '')
    filename = ad2.replace(' ', '')
    with open(f'{filename}.csv', 'w', newline='') as f:
        fcsv = csv.writer(f)
        datarows = []
        for row in table.find_all('tr'):
            heads = row.find_all('th')
            if heads:
                headers = [th.text for th in heads]
            else:
                datarows.append([td.text for td in row.find_all('td')])
        fcsv.writerow(headers)
        fcsv.writerows(datarows)
  •  Tags:  
  • Related