Home > database >  Weird behaviour when Goodreads scraping (Python)
Weird behaviour when Goodreads scraping (Python)

Time:01-26

I'm trying to scrape Goodreads and more specifically Goodreads editions by giving some ISBNs as input. However, I get an error and not even at the same step every time of the code running process:

Traceback (most recent call last):
  File "C:xxx.py", line 47, in <module>
    ed_details = get_editions_details(isbn)
  File "C:xxx.py", line 30, in get_editions_details
    ed_item = soup.find("div", class_="otherEditionsLink").find("a")
AttributeError: 'NoneType' object has no attribute 'find'

Everything should be correct, the div class is the correct one and it seems like is there for all books. I checked with every browser and the page looks the same to me. I don't know if it's because of a deprecated library or something at this point.

import requests
from bs4 import BeautifulSoup as bs


def get_isbn():
    isbns = ['9780544176560', '9781796898279', '9788845278518', '9780374165277', '9781408839973', '9788838919916', '9780349121994', '9781933372006', '9781501167638', '9781427299062', '9788842050285', '9788807018985', '9780340491263', '9789463008594', '9780739349083', '9780156011594', '9780374106140', '9788845251436', '9781609455910']
    return isbns


def get_page(base_url, data):
    try:
        r = requests.get(base_url, params=data)
    except Exception as e:
        r = None
        print(f"Server responded: {e}")
    return r


def get_editions_details(isbn):
    # Create the search URL with the ISBN of the book
    data = {'q': isbn}
    book_url = get_page("https://www.goodreads.com/search", data)
    # Parse the markup with Beautiful Soup
    soup = bs(book_url.text, 'lxml')

    # Retrieve from the book's page the link for other editions
    # and the total number of editions

    ed_item = soup.find("div", class_="otherEditionsLink").find("a")

    ed_link = f"https://www.goodreads.com{ed_item['href']}"
    ed_num = ed_item.text.strip().split(' ')[-1].strip('()')

    # Return a tuple with all the informations
    return ((ed_link, int(ed_num), isbn))


if __name__ == "__main__":
    # Get the ISBNs from the user
    isbns = get_isbn()

    # Check all the ISBNs
    for isbn in isbns:
        ed_details = get_editions_details(isbn)

CodePudding user response:

You should always check return values.

book_url = get_page("https://www.goodreads.com/search", data)
soup = bs(book_url.text, 'lxml')
ed_item = soup.find("div", class_="otherEditionsLink").find("a")

In these statements, if any of the returned values are None, you will get an error when you try to call a member function. So if soup is None, for example, you would be doing something like None.find(....), which is obviously wrong.

In the last line, for example, you can fix this by breaking it up into 2 parts:

if ed_item := soup.find("div", class_="otherEditionsLink"):
    if ed_item := ed_item.find("a"):
        ....other code here....

As long as soup is valid, this code won't try to call a function on a None value.

  •  Tags:  
  • Related