How can I use BeautifulSoup to scrape this table?-CodePudding

I am new to Python and learning data analysis. I am trying to scrape data from this web page: https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7

I am able to scrape data with simple websites but I think since BitInfoCharts has tables it may be a more complex HTML setup than the tutorials I am following.

My goal is to scrape the data from the table which includes Block, Time, Amount, Balance, ect and have it in a csv file. I previously tried using pandas but found that it was difficult to select the data I want from the HTML.

To do this, I think that what I need to do is get the header/table information from the " and then pull all of the information from each object inside that class that contains ". The class=trb changes from page to page (Example, one person may have 7 transactions, and another may have 40). I am not exactly sure though as this is new territory for me.

I would really appreciate any help.

import requests
from bs4 import BeautifulSoup as bs 
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent":"Mozilla/5.0"}

r = requests.get(url, headers=headers)

soup = bs(r.content)

table = soup.find_all("table_maina")
print(table)

CodePudding user response：

If you do decide to do it manually, this does the same thing:

import csv
import requests
from bs4 import BeautifulSoup as bs 
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent":"Mozilla/5.0"}

r = requests.get(url, headers=headers)

soup = bs(r.content,'lxml')
table = soup.find(id="table_maina")
headers = []
datarows = []
for row in table.find_all('tr'):
    heads = row.find_all('th')
    if heads:
        headers = [th.text for th in heads]
    else:
        datarows.append( [td.text for td in row.find_all('td')] )
        
fcsv = csv.writer( open('x.csv','w',newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)

CodePudding user response：

There is only one table element called 'table_maina' so you should call find() vs find_all(). Also, you need you specify the "table" tag as first argument in find() function.

Try:

table = soup.find('table', id='table_maina')
for tr in table.find_all('tr', class_='trb'):
  print(tr.text)

Output:

4066317 2022-01-17 15:41:22 UTC2022-01-17 15:41:22 UTC-33,000,000 DOGE (5,524,731.65 USD)220,000,005.04121223 DOGE$36,831,545 @ $0.167$-28,974,248
4063353 2022-01-15 11:04:46 UTC2022-01-15 11:04:46 UTC 4,000,000 DOGE (759,634.87 USD)253,000,005.04121223 DOGE$48,046,907 @ $0.19$-23,283,618
...

Next, to output each row into CSV file then try this:

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent": "Mozilla/5.0"}

r = requests.get(url, headers=headers, verify=False)
soup = BeautifulSoup(r.content, "html.parser")

table = soup.find("table", id='table_maina')
with open('out.csv', 'w', newline='') as fout:
    csv_writer = csv.writer(fout)
    csv_writer.writerow(['Block', 'Time', 'Amount', 'Balance', 'Price', 'Profit'])
    for tr in table.find_all('tr', class_='trb'):
        tds = tr.find_all('td')
        csv_writer.writerow([x.text for x in tds])

Output:

Block,Time,Amount,Balance,Price,Profit
4066317 2022-01-17 15:41:22 UTC,2022-01-17 15:41:22 UTC,"-33,000,000 DOGE (5,524,731.65 USD)","220,000,005.04121223 DOGE","$36,831,545 @ $0.167","$-28,974,248"
...