Home > Mobile >  Difficulty Scrapping Product Information from Website
Difficulty Scrapping Product Information from Website

Time:02-08

I am having difficulties scrapping the "product name" and "price" from this website: https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571

Looking to scrap "$4.30" and "Zespri New Zealand Kiwifruit - Green" from the webpage. I have tried various approaches (Beautiful Soup, request_html, selenium) without any success. Attached the sample code approaches I have taken.

I am able to view the 'price' and 'product name' details in the "Developer Tools" tab of Chrome. It seems like that webpage uses Javascript to dynamically load the product information, so the various approaches mentioned above are not able to scrape the information properly.

Appreciate any assistance on this issue.

Requests_html Approach:

from requests_html import HTMLSession   
import json
 
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
 
json_text=r.html.xpath("//script[@type='application/ld json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])

Beautiful Soup Approach:

import sys
import time
from bs4 import BeautifulSoup
import requests
import re

url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
        
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')

linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)

linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)

Selenium Approach:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"

options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)

        
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')

linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)

CodePudding user response:

That approach of yours with the embedded JSON needs some refinement. In other words, you're almost there. Also, this can be done with pure requests and bs4.

PS. I'm using different URLS, as the one you give returns a 404.

Here's how:

import json

import requests
from bs4 import BeautifulSoup

urls = [
    "https://www.fairprice.com.sg/product/11798142",
    "https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
    "https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}

for url in urls:
    product_data = (
        json.loads(
            BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
            .find("script", type="application/ld json")
            .string[:-1]
        )
    )

    print(product_data["name"])
    print(product_data["offers"]["price"])

This should output:

Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8
  •  Tags:  
  • Related