Webscraping - Python - "Nonetype Object has no attribute text"-CodePudding

Im scraping a product page with following script:

from requests_html import HTMLSession
import re



s = HTMLSession()

link = "https://www.kaufland.de/product/358005366/"
def get_products(link):
    r = s.get(link)
    title = r.html.find('h1', first=True).text
    price = r.html.find('div.rd-buybox__price', first=True).text.replace(' €', '').replace(',', '.')
    descriptiontable = r.html.find('div.rd-attribute-table', first=True).text
    print(title, price, descriptiontable)
get_products(link)

The area i try to scrape (Containing the producer, ean ecetera) doesnt seem to be scrapable, unliek price and title. What am i doing wrong?

CodePudding user response：

It looks like the product details table you're after is populated by JavaScript after the page loads, so it's not in the HTML retrieved by r = s.get(link). As explained in rayt's answer, this is why you get None returned.

However, the data that the table contains is on the page, inside a <script> tag near the bottom:

<script> window.__NUXT__ = (function(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, _, $, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, an, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, aA, aB, aC, aD, aE, aF, aG, aH, aI, aJ, aK, aL, aM, aN, aO, aP, aQ, aR, aS, aT, aU, aV, aW, aX, aY, aZ, a_, a$, ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp, bq, br, bs, bt, bu, bv, bw, bx, by, bz, bA, bB, bC, bD, bE, bF, bG, bH, bI, bJ, bK, bL, bM, bN, bO, bP, bQ, bR, bS, bT, bU, bV, bW, bX, bY, bZ, b_, b$, ca, cb, cc, cd, ce, cf, cg, ch, ci, cj, ck, cl, cm, cn, co, cp, cq, cr, cs, ct, cu, cv, cw, cx, cy, cz, cA, cB, cC, cD, cE, cF, cG, cH, cI, cJ, cK, cL, cM, cN, cO, cP, cQ, cR, cS, cT, cU, cV, cW, cX, cY, cZ, c_, c$, da, db, dc, dd, de, df, dg, dh, di, dj, dk, dl, dm, dn, do0, dp, dq, dr, ds, dt, du, dv, dw, dx, dy, dz, dA, dB, dC, dD, dE, dF, dG, dH, dI, dJ, dK, dL, dM, dN, dO, dP, dQ, dR, dS, dT, dU, dV, dW, dX, dY, dZ, d_, d$, ea, eb, ec, ed, ee, ef, eg, eh, ei, ej, ek, el, em, en, eo, ep, eq, er, es, et, eu, ev, ew, ex, ey, ez, eA, eB, eC, eD, eE, eF, eG, eH, eI, eJ, eK, eL, eM, eN, eO, eP, eQ, eR, eS, eT, eU, eV, eW, eX, eY, eZ, e_, e$, fa, fb, fc, fd, fe, ff, fg, fh, fi, fj, fk, fl, fm, fn, fo, fp, fq, fr, fs, ft, fu, fv, fw, fx, fy, fz, fA, fB, fC, fD, fE, fF, fG, fH, fI, fJ, fK, fL, fM, fN, fO, fP, fQ, fR, fS, fT, fU, fV, fW, fX, fY, fZ, f_, f$, ga, gb, gc, gd, ge, gf, gg, gh, gi, gj, gk, gl, gm, gn, go, gp, gq, gr, gs, gt, gu, gv, gw, gx, gy, gz, gA, gB, gC, gD, gE, gF, gG, gH, gI, gJ, gK, gL, gM, gN, gO, gP, gQ, gR, gS, gT, gU, gV, gW, gX, gY, gZ, g_, g$, ha, hb, hc, hd, he, hf, hg, hh, hi, hj, hk, hl, hm) {
    return {
        layout: cG,
        data: [{}],
        fetch: {},

            ...

                },
                description$: {
                    descriptionHtml: "\u003Cp\u003E\u003Cb\u003EIm System sind folgende komponenten verbaut:\u003C\u002Fb\u003E\u003C\u002Fp\u003E\u003Cul\u003E\u003Cli\u003E\u003Cb\u003EGehäuse:\u003C\u002Fb\u003E Systemtreff Mini Tower Nero ST-401\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EProzessor: \u003C\u002Fb\u003EIntel Core i5-10400F 6 x 2.9 GHz (bei Bedarf bis zu 4.3 GHz Turbotakt durch Intel Turbo-Boost Technik)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EArbeitsspeicher:\u003C\u002Fb\u003E 16 GB DDR4 2666 MHz \u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EMainboard:\u003C\u002Fb\u003E Gigabyte H510M H, Intel Sockel 1200 (1 x PCIe 4.0\u002F3.0 x16 (x16 mode), 1 x PCIe 3.0 x1, 1 x PS\u002F2 keyboard \u002F PS\u002F2 mouse, 1 x VGA 1 x HDMI,  1 x LAN (RJ45), 2 x USB 3.2, 4 x USB 2.0, 1 x M.2 (Key M), 4xSATA) - max. 64 GB DDR4 - 3200 MHz\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzwerk:\u003C\u002Fb\u003E 1 x Gigabit LAN Controller(s)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESound:\u003C\u002Fb\u003E Realtek® ALC887 8-Channel High Definition Audio CODEC\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EFestplatte:\u003C\u002Fb\u003E 512GB M.2 SSD SATA III\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EGrafik:\u003C\u002Fb\u003E NVIDIA GeForce GT 730 mit 2048 MB \u002F 2GB RAM \u003Cul\u003E\u003Cli\u003ETechnik: ( GDDR3 \u002F DirectX 11 \u002F PCI Express 2.0 \u002F ) \u003C\u002Fli\u003E \u003Cli\u003EGeeignet für Heimvideos - Blu-ray FULL HD - Videobearbeitung \u002F World of Warcraft, Spore oder Sims3, sowie die Anschlussmöglichkeiten von bis zu 2 Monitore\u003C\u002Fli\u003E\u003C\u002Ful\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzteil:\u003C\u002Fb\u003E 400-500Watt Marken Netzteil\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ELaufwerk:\u003C\u002Fb\u003E Kein Laufwerk verbaut\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EBetriebssystem:\u003C\u002Fb\u003E Windows 10 Pro\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESKU:\u003C\u002Fb\u003E 20192420\u003C\u002Fli\u003E\u003Cli\u003EMarkennamen -  Markenlogos sind registrierte Handelsmarken, deren Nutzung hier nur zur Produktbeschreibung eingesetzt werden - das Eigentumsrecht liegt beim jeweiligen Markeninhaber.\u003C\u002Fli\u003E\u003C\u002Ful\u003E",
                    attributes: {
                        default: [{
                            name: "Hersteller",
                            id: "manufacturer",
                            values: [{
                                text: "SYSTEMTREFF",
                                link: "\u002Fmanufacturer\u002F1428338\u002F",
                                isMasked: a
                            }],
                            isCategoryRelevant: d,
                            isDefaultRelevant: d
                        }, {
                            name: "Betriebssystem",
                            id: "operating_system",
                            values: [{
                                text: "Windows 10 Pro",
                                link: "\u002Fcategory\u002F39251\u002Fref-381=1388287\u002F",
                                isMasked: a
                            }],
                            isCategoryRelevant: d,
                            isDefaultRelevant: a
                        }, {
                            name: cJ,
                            id: cK,
                            values: [{
                                text: cL,
                                link: cM,
                                isMasked: a
                            }],
                            isCategoryRelevant: a,
                            isDefaultRelevant: a
                        }, {

I hope you'll forgive my use of BeautifulSoup in this example, I'm more familiar with it than requests_html, but here's how you might fetch the <script> tag content:

import requests
from bs4 import BeautifulSoup

def get_products(link):
    r = requests.get(link)
    html = r.text
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('h1').text.strip()
    price = soup.find('div', {'class':'rd-buybox__price'}).text.strip().replace(' €', '').replace(',', '.')
    descriptiontable = extract_description(soup)
    print(title, price, descriptiontable)

def extract_description(soup):
    product_data = soup.find_all('script')[2] # 3rd script tag
    product_data = str(product_data).partition('return {')[-1]
    product_data = '{'   product_data.split('}(')[0]   '}'
    product_data =  # You'll need to parse this content here to find the bits you need
    return product_data


if __name__ == '__main__':
    link = "https://www.kaufland.de/product/358005366/"
    get_products(link)

CodePudding user response：

It seems like the .find method returns None if no match was found. You should test for None first. You could do something like this:

...
def get_products(link):
    ...
    title_tag = r.html.find("h1", first=True)
    if title_tag is None:
        return
    title = title_tag.text
    ...

If you expect .find to actually find something and it doesn't, you should check your logic and selectors. Also, try saving the raw html and reading the output. Web pages sometimes return something different when you are attempting to scrape them.

You can save the output to a file in Python like this:

with open("output.html", "w", encoding="utf-8") as fid:
    fid.write(r.text)

I'm assuming .text attribute/property exists on the Response object, but Requests-HTML documentation does not seem very complete.