Trying to Scrape Table with Python's Beautifulsoup & Selenium-CodePudding

As the title suggests, I am trying to scrape a table using both Beautifulsoup and Selenium. I'm aware I most likely do not need both libraries, however I wanted to try if using Selenium's xpathselectors would help, which unfortunately they did not.

The website can be found here:

Once I can grab the table, I will collect the td data inside the table rows.

So for example, I would want '29/Dec/2021'under 'Publication Date'. Unfortunately, I haven't been able to get this far because I can't grab the table.

Here is my code:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

url = 'https://app.capitoltrades.com/politician/491'
resp = requests.get(url)
#soup = BeautifulSoup(resp.text, "html5lib")
soup = BeautifulSoup(resp.text, 'lxml')
table = soup.find("table", {"class": "p-datatable-table ng-star- 
inserted"}).findAll('tr')  
print(table)

This yields the error message "AttributeError: 'NoneType' object has no attribute 'findAll'

Using 'soup.findAll' also does not work.

If I try the xpathselector route using Selenium ...

DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.implicitly_wait(1000)
driver.get('https://app.capitoltrades.com/politician/491')

table = driver.find_element_by_xpath("//*[@id='pr_id_2']/div/table").text
print(table)

Chrome continues to open up and nothing gets printed inside my Jupyter notebook (probably because there is no text directly inside the table element[?])

I'd prefer to be able to grab the table element using Beautifulsoup, but all answers are welcomed. I appreciate any help you can provide me with.

CodePudding user response：

That site has a backend api that can be hit very easily:

import requests
import pandas as pd

url = 'https://api.capitoltrades.com/senators/trades/491/false?pageSize=20&pageNumber=1'
resp = requests.get(url).json()

df = pd.DataFrame(resp)
df.to_csv('naughty_nancy_trades.csv',index=False)
print('Saved to naughty_nancy_trades.csv ')

to see where all the data comes from open your browser's Developer Tools - Network - fetch/XHR and reload the page you'll see them fire. I've scraped one of those network calls, there are others for all the data on that page

CodePudding user response：

As per your Selenium code: you are missing a wait.
This

driver.find_element_by_xpath("//*[@id='pr_id_2']/div/table")

command returns you the web element when it is just created i.e already existing, but still not fully rendered.
This should work better:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
wait = WebDriverWait(driver, 20)

driver.get('https://app.capitoltrades.com/politician/491')

table = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@id='pr_id_2']//tr[@class='p-selectable-row ng-star-inserted']"))).text
print(table)

As per you BS4 code, looks like you are using a wrong locator.
This:

table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"})

looks to be better (you have extra spaces in your class name).
The line above returns 5 elements.
So this supposed to work:

table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"}).findAll('tr')