I'm scraping a website's HTML after a "GET" request. There is product1218181 parameter on the site I want to extract data from, so there is product{1218181}. I'm using Beautiful soup since it's what I usually use but I can't seem to figure out how to get a javascript variable from the html. HTML like so:
<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>
I would like to scrape like this:
name: XIAOMI Poco X3 Pro 256 GB Akıllı Telefon
id: 1218181
price: 5799.00
brand: XIAOMI
Update
Full code like this, I would like to scrape this website products infos
import requests
import re, json
from bs4 import BeautifulSoup
URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonları-504171.html"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="category")
test = '<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'
pattern = re.compile('.*?var product1218181 = (.*?);.*?')
match = pattern.match(test)
if match is not None:
data = json.loads(match.groups()[0])
for key, value in data.items():
print(key, ":", value)
CodePudding user response:
You can use regex (re module) to extract the the line and then treat it with json.loads() to parse the json value into a dict
Here is a sample snippet:
import re, json
test = '<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'
pattern = re.compile('.*?var product. = (.*?);.*?')
match = pattern.match(test)
if match is not None:
data = json.loads(match.groups()[0])
for key, value in data.items():
print(key, ":", value)
output:
name : XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz
id : 1218181
price : 5799.00
brand : XIAOMI
ean : 6934177738371
dimension25 : InStock
dimension26 : 11.9
dimension24 : 18.0
category : Telefon
dimension9 : Cep Telefonları
dimension10 : Android Telefonlar
CodePudding user response:
You can select the variable in your requests.get().text with regex and load the string with json.loads():
m = re.search(r'var product. = ({.*})', page.text)
json.loads(m.group(1))
Example to get list of dicts:
import requests
import re, json
from bs4 import BeautifulSoup
URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonları-504171.html"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
data = [json.loads(m.group(1)) for m in re.finditer(r'var product. = ({.*})', page.text)]
Output
[{'name': 'XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz', 'id': '1218181', 'price': '5799.00', 'brand': 'XIAOMI', 'ean': '6934177738371', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 12 64GB Akıllı Telefon Yeşil', 'id': '1212811', 'price': '14749.00', 'brand': 'APPLE', 'ean': '0194252030943', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'SAMSUNG Galaxy A22 128 GB Akıllı Telefon Beyaz', 'id': '1217491', 'price': '3499.00', 'brand': 'SAMSUNG', 'ean': '8806092288300', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Samsung Telefon'}, {'name': 'XIAOMI Redmi 9T 128 GB Akıllı Telefon Yeşil', 'id': '1216309', 'price': '3399.00', 'brand': 'XIAOMI', 'ean': '6934177746031', 'dimension25': 'OutOfStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 12 128GB Akıllı Telefon Siyah', 'id': '1212812', 'price': '15699.00', 'brand': 'APPLE', 'ean': '0194252031285', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'APPLE iPhone 11 64GB Akıllı Telefon Sarı', 'id': '1212830', 'price': '10349.00', 'brand': 'APPLE', 'ean': '0194252098264', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'CASPER VIA F20 128 GB Akıllı Telefon Beyaz', 'id': '1216984', 'price': '2999.00', 'brand': 'CASPER', 'ean': '8699247212134', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'VIVO Y53S 128 GB Akıllı Telefon Derin Mavi', 'id': '1217949', 'price': '4499.00', 'brand': 'VIVO', 'ean': '6935117836812', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A74 128 GB Akıllı Telefon Gece Mavisi', 'id': '1215862', 'price': '4499.00', 'brand': 'OPPO', 'ean': '8683040000227', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'XIAOMI Redmi 9T 128 GB Akıllı Telefon Gri', 'id': '1216310', 'price': '3399.00', 'brand': 'XIAOMI', 'ean': '6934177746086', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'VIVO Y53S 128 GB Akıllı Telefon Gökkuşağı', 'id': '1218011', 'price': '4499.00', 'brand': 'VIVO', 'ean': '6935117836829', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A55 64GB Akıllı Telefon Yıldızlı Siyah', 'id': '1218661', 'price': '3499.00', 'brand': 'OPPO', 'ean': '8683040000418', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Oppo Telefon'}, {'name': 'OPPO A55 64GB Akıllı Telefon Gökkuşağı Mavisi', 'id': '1218660', 'price': '3499.00', 'brand': 'OPPO', 'ean': '8683040000425', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Oppo Telefon'}, {'name': 'TCL 20 E 32 GB Akıllı Telefon Mavi', 'id': '1217712', 'price': '2399.00', 'brand': 'TCL', 'ean': '4894461894812', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A74 128 GB Akıllı Telefon Prizma Siyahı', 'id': '1215856', 'price': '4499.00', 'brand': 'OPPO', 'ean': '8683040000210', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 11 128GB Akıllı Telefon Mor', 'id': '1212837', 'price': '10849.00', 'brand': 'APPLE', 'ean': '0194252100431', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'XIAOMI Redmi Note 10 S 128 GB Akıllı Telefon Beyaz', 'id': '1217380', 'price': '4999.00', 'brand': 'XIAOMI', 'ean': '6934177748431', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'CASPER VIA E4 32 GB Akıllı Telefon Siyah', 'id': '1216978', 'price': '2299.00', 'brand': 'CASPER', 'ean': '8699247209356', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 13 Mini 128 GB Akıllı Telefon Starlight', 'id': '1217590', 'price': '14799.00', 'brand': 'APPLE', 'ean': '0194252689950', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'APPLE iPhone 13 Mini 256 GB Akıllı Telefon Starlight', 'id': '1217595', 'price': '16199.00', 'brand': 'APPLE', 'ean': '0194252691304', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}]
CodePudding user response:
I wrote this script to parse the JSON inside that script tag. I used json library along with BeautifulSoup.
First I looped through all scripts in the websites (in case there are multiple scripts and we don't have an id or class for each script) and selected the script that we need, the one that contains "name" (You can make it more accurate).
Then with simple string modifications I was able to extract the dictionary / json data.
from bs4 import BeautifulSoup
import json
html = '''<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'''
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('script'):
if '= {"name":' in item.text:
dictionary = item.text.split(' = ', 1)[-1][:-1]
jsonResponse = json.loads(dictionary)
print(jsonResponse)
CodePudding user response:
Try this:
import requests
import re, json
from bs4 import BeautifulSoup
URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonları-504171.html"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="category").find("script").text
data = json.loads(re.findall("(?:{).*(?:})", results)[0])
print(data)
