Home > OS >  Trying to Scrape Table with Python's Beautifulsoup & Selenium
Trying to Scrape Table with Python's Beautifulsoup & Selenium

Time:01-14

As the title suggests, I am trying to scrape a table using both Beautifulsoup and Selenium. I'm aware I most likely do not need both libraries, however I wanted to try if using Selenium's xpathselectors would help, which unfortunately they did not.

The website can be found here:

Table Data

Once I can grab the table, I will collect the td data inside the table rows.

So for example, I would want '29/Dec/2021'under 'Publication Date'. Unfortunately, I haven't been able to get this far because I can't grab the table.

Here is my code:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

url = 'https://app.capitoltrades.com/politician/491'
resp = requests.get(url)
#soup = BeautifulSoup(resp.text, "html5lib")
soup = BeautifulSoup(resp.text, 'lxml')
table = soup.find("table", {"class": "p-datatable-table ng-star- 
inserted"}).findAll('tr')  
print(table)

This yields the error message "AttributeError: 'NoneType' object has no attribute 'findAll'

Using 'soup.findAll' also does not work.

If I try the xpathselector route using Selenium ...

DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.implicitly_wait(1000)
driver.get('https://app.capitoltrades.com/politician/491')

table = driver.find_element_by_xpath("//*[@id='pr_id_2']/div/table").text
print(table)

Chrome continues to open up and nothing gets printed inside my Jupyter notebook (probably because there is no text directly inside the table element[?])

I'd prefer to be able to grab the table element using Beautifulsoup, but all answers are welcomed. I appreciate any help you can provide me with.

CodePudding user response:

That site has a backend api that can be hit very easily:

import requests
import pandas as pd

url = 'https://api.capitoltrades.com/senators/trades/491/false?pageSize=20&pageNumber=1'
resp = requests.get(url).json()

df = pd.DataFrame(resp)
df.to_csv('naughty_nancy_trades.csv',index=False)
print('Saved to naughty_nancy_trades.csv ')

to see where all the data comes from open your browser's Developer Tools - Network - fetch/XHR and reload the page you'll see them fire. I've scraped one of those network calls, there are others for all the data on that page

CodePudding user response:

As per your Selenium code: you are missing a wait.
This

driver.find_element_by_xpath("//*[@id='pr_id_2']/div/table")

command returns you the web element when it is just created i.e already existing, but still not fully rendered.
This should work better:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
wait = WebDriverWait(driver, 20)

driver.get('https://app.capitoltrades.com/politician/491')

table = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@id='pr_id_2']//tr[@class='p-selectable-row ng-star-inserted']"))).text
print(table)

As per you BS4 code, looks like you are using a wrong locator.
This:

table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"})

looks to be better (you have extra spaces in your class name).
The line above returns 5 elements.
So this supposed to work:

table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"}).findAll('tr')
  •  Tags:  
  • Related