How to ignore some of items in a list with selenium-CodePudding

I'm try to scrape something like this.

    <ul>
        <li> apple <b>price:2.8</b> </li>
        <li> orange </li>
        <li> banana <b>price:4.3</b> </li>
        <li> peach <b>price:2.3</b> </li>
    </ul>

Some of the item don't have a price and not sure which item will be. I need to get the name and the price. if it doesn't have a price then ignore this line.

Here is my code:

name_list = driver.find_elements(By.TAG_NAME, "li")
price_list = driver.find_elements(By.TAG_NAME, "b")

for n in name_list:
    name = name_list[n]
    price = price_list[n]

The error message is "IndexError: list index out of range" because name and price has different length.

Is there somehow it can be fixed?

CodePudding user response：

Don't create two separate lists. You can get all <li/> elements, iterate over the results and call find_elements(...) on each element to query its children.

Something like this could work:

name_list = driver.find_elements_by_tag_name("li")

for n in name_list:
    child = n.find_elements_by_tag_name("b")
    // check if it is present, then do stuff

Check the Selenium docs for details.

(Also you are using for ... in incorrectly, it iterates over elements, not indeces)

CodePudding user response：

Here is a complete working example that shows how to get the price data:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
driver.get("file:///Users/brian/pysel/test_data.html")

name_list = driver.find_elements(By.TAG_NAME, "li")
price_list = driver.find_elements(By.TAG_NAME, "b")

namePriceData = {}

for i, ele in enumerate(name_list):
    name = name_list[i]

    try:
        price = ele.find_element(By.TAG_NAME, "b")
    except NoSuchElementException as err:
        namePriceData[name.text] = ""
        continue
        
    namePriceData[name.text[:len(name.text) - len(price.text)]] = price.text

print(namePriceData)

driver.close()

We accomplish this by calling find_element on each of the li elements in name_list. This allows us to get a child of an element.

This test data is a little weird because we have to get the text of the parent element and then slice the child text off using the string lengths. If you encounter real data with a different node structure the strategy for how you find these elements on-page will change.

Also, please note:

The original code returned an error because it uses the n for the key you are attempting to index into the list of elements with.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/ct/tm9p5wz92dz4l60w9l503f780000gn/T/ipykernel_27037/104717083.py in <module>
      6 
      7 for n in name_list:
----> 8     name = name_list[n]
      9     price = price_list[n]

TypeError: list indices must be integers or slices, not WebElement

If you want to access an element in that way we need to iterate using an index instead of the actual value. In Python, we do this with enumerate.

Using the length of one array to index into the other is wrong because of the index out of bound errors but also once we reach a value with a blank price it will actually be assigned the price of the next item that really has a price and the assigned values are incorrect from that point until we hit the index out of bounds error.

Instead, it would be better to iterate using the actual element and not an index:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
driver.get("file:///Users/brian/pysel/test_data.html")

name_list = driver.find_elements(By.TAG_NAME, "li")
price_list = driver.find_elements(By.TAG_NAME, "b")

namePriceData = {}

for ele in name_list:
    try:
        price = ele.find_element(By.TAG_NAME, "b")
    except NoSuchElementException as err:
        namePriceData[ele.text] = ""
        continue
        
    namePriceData[ele.text[:len(ele.text) - len(price.text)]] = price.text

print(namePriceData)

driver.close()