I'm trying to get data from a search result but every time I try to use a specific link to give to Beautiful Soup I get errors and I think it is because the webpage isn't the same every time you visit it? I'm not exactly sure what this is called to even search so any help would be appreciated.
This is the link to the search results. But when you go to visit it unless you've already made a search it won't show up the results. https://www.clarkcountycourts.us/Portal/Home/WorkspaceMode?p=0
instead, if you copy and paste it will take you to this page to make a search. https://www.clarkcountycourts.us/Portal/ and then you have to click smart search.
So for simplicity's sake, let's say we search for "Robinson" and I need to take the table data and export it to an excel file. I cant give beautiful soup a link because it isn't valid I believe? How would I go about this challenge?
Even pulling the tables up with a simple view table doesn't give any info about the data from our search of "Robinson" such as Case Number or File Date to create a pandas data frame.
//EDIT// so far thanks to @Arundeep Chohan This is what I've got.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
driver.get("https://www.clarkcountycourts.us/Portal/Home/Dashboard/29")
search_box = driver.find_element_by_id("caseCriteria_SearchCriteria")
search_box.send_keys("Robinson")
#Code to complete captchas
WebDriverWait(driver, 15).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[name^='a-'][src^='https://www.google.com/recaptcha/api2/anchor?']")))
WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//span[@id='recaptcha-anchor']"))).click()
driver.switch_to.default_content() #necessary to switch out of iframe element for submit button
time.sleep(5) #gives time to click submit to results
submit_box = driver.find_element_by_id("btnSSSubmit").click()
time.sleep(5)
soup = BeautifulSoup(driver.page_source)
tbl =soup.findAll("table")
dfs = pd.read_html(str(tbl))
df=dfs
print(df)
It managed to open Chrome and get to the tables of data. But now I'm having a problem with Beautiful Soup error:
c:\Users\phlfo\Web_Scraper\1scraper.py:33: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 33 of the file c:\Users\phlfo\Web_Scraper\1scraper.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
CodePudding user response:
options = Options()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
driver.maximize_window()
wait=WebDriverWait(driver,10)
driver.get('https://www.clarkcountycourts.us/Portal/')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"a.portlet-buttons"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input#caseCriteria_SearchCriteria"))).send_keys("Robinson")
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@title='reCAPTCHA']")))
elem=wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"div.recaptcha-checkbox-checkmark")))
driver.execute_script("arguments[0].click()", elem)
driver.switch_to.default_content()
x = input("Waiting for recaptcha done")
wait.until(EC.element_to_be_clickable((By.XPATH,"(//input[@id='btnSSSubmit'])[1]"))).click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(str(soup))[0]
print(df)
Should be the minimum to get to your page if you want to know.There's an iframe to deal and the spinner to deal with. After this just use pandas to grab the table.
(edit): They added a recaptcha properly so add a solver where I added my pause input.
Import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from bs4 import BeautifulSoup
Outputs:
Waiting for manual date to be entered. Enter YES when done.
Unnamed: 0_level_0 ... Date of Birth
Case Number ... File Date
Case Number ... File Date
0 NaN ... NaN
1 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
2 Case Number ... File Date
3 08A575873 ... 11/17/2008
4 NaN ... NaN
5 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
6 Case Number ... File Date
7 08A575874 ... 11/17/2008
