Problem with the code when running a loop with selenium in Python-CodePudding

I have to perform a web scrapping in the website https://portalbnmp.cnj.jus.br/#/pesquisa-peca.

My goal is to select "Rio de Janeiro" in the field 'Estado"
Send the key "" to the field "Nome"
Search
In the table that appears, I have to click in each row.
Click "Emitir" in the next page
Return to previous page and go to the process again for the next line of the table and so on.

My code bellow runs withou error when I run line by line, but in the loop I get all kinds of error. Stale, not clickable, not executable, etc. Some ideia of why this might happen?

for i in range(1, 11):
   
    element = driver.find_element_by_tag_name('p-dropdown')
    element.find_element_by_xpath("//*[contains(text(), 'Estado')]").click()
    element.find_element_by_xpath("//*[contains(text(), 'Rio de Janeiro')]").click()
        
    search = driver.find_element_by_name("nomePessoa")
    search.send_keys("")
    
    search.send_keys(Keys.RETURN)
         
    # row click 
    table = driver.find_element_by_xpath("//div[@class='ui-datatable-tablewrapper ng-star-inserted']/table/tbody")
    rows = table.find_element_by_tag_name('tr')
    
    rows.find_element_by_xpath("//tr["   str(i)   "]/td[1]").click()
    
    # click 'Emitir'
    buttons = driver.find_element_by_tag_name("button")
    buttons.find_element_by_xpath("//*[contains(text(), 'Emitir')]").click()
    
    # return page
    driver.back()

CodePudding user response：

When using Selenium try adding in checks to make sure the elements you are interacting with are loaded. In some cases you can add in explicit waits. (Try not to use methods like sleep() as it is strongly advised against per the documentation).

# import webdriver 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# get element  after explicitly waiting up to 10 seconds
element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "p-dropdown"))
    )  # I would consider looking up by ID or class
element.find_element_by_xpath("//*[contains(text(), 'Estado')]").click()
... etc

This will make it so that you never click an element before it is loaded. Another thing to keep in mind with Selenium is that an element must be visible in order to interact with it. You can scroll to an element which will ensure it is visible by doing:

# example that scrolls to bottom of page
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
# example that scrolls to a specific element
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
element = driver.find_element_by_tag_name('p-dropdown')  # just an example
actions.move_to_element(element)

CodePudding user response：

You can avoid using Selenium and speed up this process massively if you copy the cookie out of your browser and paste it into this code below which will search Rio de Janiero (idEstado = 19) and return 100 results (you can edit this), then loop through the results and save the PDF files you want.

Note that the site you are scraping is volatile and often returns 500 responses, I have retried requests after waiting a few seconds:

import requests
import json
import re
import time

#NB get cookie header from Developer Tools - Network - fetch/xhr - Request Headers once you've passed the captcha test
cookie_value = 'portalbnmp=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJndWVzdF9wb3J0YWxibm1wIiwiYXV0aCI6IlJPTEVfQU5PTllNT1VTIiwiZXhwIjoxNjQzMzY1MjgzfQ.niaw12WlnO3okuY33medP7d3u6j1Y-xGPJ6mShgClfZPrs8br7HQm8XZ5k2k5Wz8J59epbUyE5KAGtSFPpEmrA'

headers =   {
    'accept':'application/json, text/plain, */*',
    'accept-encoding':'gzip, deflate, br',
    'accept-language':'en-ZA,en;q=0.9',
    'origin':'https://portalbnmp.cnj.jus.br',
    'referer':'https://portalbnmp.cnj.jus.br/',
    'content-type':'application/json;charset=UTF-8',
    'cookie': cookie_value,
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }

results = 100
url = f'https://portalbnmp.cnj.jus.br/bnmpportal/api/pesquisa-pecas/filter?page=0&size={str(results)}&sort=' #edited to get 100 results, you can edit this size variable

payload = {"buscaOrgaoRecursivo":False,"orgaoExpeditor":{},"idEstado":19} #19 = Rio de Janiero

retries = 1
success = False
while not success:
    try:
        resp = requests.post(url,headers=headers,data=json.dumps(payload))
        print(resp)
        if resp.status_code == 200:
            success = True
        data = resp.json()
    except Exception as e:
        print(url)
        wait = retries
        print(f'Error! Waiting {wait} secs and re-trying...')
        time.sleep(wait)
        retries  = 1

print(len(data['content']))

ids = {str(x['id']):x['nomeMae'] '-' x['nomeOrgao'] for x in data['content']} #get all filenames and IDs

for id_,name in ids.items():
    url = f'https://portalbnmp.cnj.jus.br/bnmpportal/api/certidaos/relatorio/{id_}/10'

    retries = 1
    success = False
    while not success:
        try:
            pdf_data = requests.post(url,headers=headers)
            if pdf_data.status_code == 200:
                success = True
        except Exception as e:
            wait = retries
            print(f'Error! Waiting {wait} secs and re-trying...')
            time.sleep(wait)
            retries  = 1

    filename = re.sub(r'[^\w\-_ ]', '_',name) '.pdf' #remove bad characters for filename
    print(f'Saving {name}')
    with open(filename,'wb') as file:
        file.write(pdf_data.content)