I have to perform a web scrapping in the website https://portalbnmp.cnj.jus.br/#/pesquisa-peca.
- My goal is to select "Rio de Janeiro" in the field 'Estado"
- Send the key "" to the field "Nome"
- Search
- In the table that appears, I have to click in each row.
- Click "Emitir" in the next page
- Return to previous page and go to the process again for the next line of the table and so on.
My code bellow runs withou error when I run line by line, but in the loop I get all kinds of error. Stale, not clickable, not executable, etc. Some ideia of why this might happen?
for i in range(1, 11):
element = driver.find_element_by_tag_name('p-dropdown')
element.find_element_by_xpath("//*[contains(text(), 'Estado')]").click()
element.find_element_by_xpath("//*[contains(text(), 'Rio de Janeiro')]").click()
search = driver.find_element_by_name("nomePessoa")
search.send_keys("")
search.send_keys(Keys.RETURN)
# row click
table = driver.find_element_by_xpath("//div[@class='ui-datatable-tablewrapper ng-star-inserted']/table/tbody")
rows = table.find_element_by_tag_name('tr')
rows.find_element_by_xpath("//tr[" str(i) "]/td[1]").click()
# click 'Emitir'
buttons = driver.find_element_by_tag_name("button")
buttons.find_element_by_xpath("//*[contains(text(), 'Emitir')]").click()
# return page
driver.back()
CodePudding user response:
When using Selenium try adding in checks to make sure the elements you are interacting with are loaded. In some cases you can add in explicit waits. (Try not to use methods like sleep() as it is strongly advised against per the documentation).
# import webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# get element after explicitly waiting up to 10 seconds
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "p-dropdown"))
) # I would consider looking up by ID or class
element.find_element_by_xpath("//*[contains(text(), 'Estado')]").click()
... etc
This will make it so that you never click an element before it is loaded. Another thing to keep in mind with Selenium is that an element must be visible in order to interact with it. You can scroll to an element which will ensure it is visible by doing:
# example that scrolls to bottom of page
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
# example that scrolls to a specific element
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
element = driver.find_element_by_tag_name('p-dropdown') # just an example
actions.move_to_element(element)
CodePudding user response:
You can avoid using Selenium and speed up this process massively if you copy the cookie out of your browser and paste it into this code below which will search Rio de Janiero (idEstado = 19) and return 100 results (you can edit this), then loop through the results and save the PDF files you want.
Note that the site you are scraping is volatile and often returns 500 responses, I have retried requests after waiting a few seconds:
import requests
import json
import re
import time
#NB get cookie header from Developer Tools - Network - fetch/xhr - Request Headers once you've passed the captcha test
cookie_value = 'portalbnmp=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJndWVzdF9wb3J0YWxibm1wIiwiYXV0aCI6IlJPTEVfQU5PTllNT1VTIiwiZXhwIjoxNjQzMzY1MjgzfQ.niaw12WlnO3okuY33medP7d3u6j1Y-xGPJ6mShgClfZPrs8br7HQm8XZ5k2k5Wz8J59epbUyE5KAGtSFPpEmrA'
headers = {
'accept':'application/json, text/plain, */*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-ZA,en;q=0.9',
'origin':'https://portalbnmp.cnj.jus.br',
'referer':'https://portalbnmp.cnj.jus.br/',
'content-type':'application/json;charset=UTF-8',
'cookie': cookie_value,
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
results = 100
url = f'https://portalbnmp.cnj.jus.br/bnmpportal/api/pesquisa-pecas/filter?page=0&size={str(results)}&sort=' #edited to get 100 results, you can edit this size variable
payload = {"buscaOrgaoRecursivo":False,"orgaoExpeditor":{},"idEstado":19} #19 = Rio de Janiero
retries = 1
success = False
while not success:
try:
resp = requests.post(url,headers=headers,data=json.dumps(payload))
print(resp)
if resp.status_code == 200:
success = True
data = resp.json()
except Exception as e:
print(url)
wait = retries
print(f'Error! Waiting {wait} secs and re-trying...')
time.sleep(wait)
retries = 1
print(len(data['content']))
ids = {str(x['id']):x['nomeMae'] '-' x['nomeOrgao'] for x in data['content']} #get all filenames and IDs
for id_,name in ids.items():
url = f'https://portalbnmp.cnj.jus.br/bnmpportal/api/certidaos/relatorio/{id_}/10'
retries = 1
success = False
while not success:
try:
pdf_data = requests.post(url,headers=headers)
if pdf_data.status_code == 200:
success = True
except Exception as e:
wait = retries
print(f'Error! Waiting {wait} secs and re-trying...')
time.sleep(wait)
retries = 1
filename = re.sub(r'[^\w\-_ ]', '_',name) '.pdf' #remove bad characters for filename
print(f'Saving {name}')
with open(filename,'wb') as file:
file.write(pdf_data.content)
