Basically i,m scraping data from a web that is availabe on their script tag but i,m unable to extract data into proper layout , there is my script tag raw data
{
"@context": "https://schema.org/",
"@type": "Product",
"name": "I Got Toddler Problems Tee",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136",
"sku": "BMRSUQNGGS",
"image": [
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsmauv.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsltgray.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsblk.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemspk.png",
],
"description": "BMRSUQNGGS",
"brand": {"@type": "Thing", "name": "InspireUplift"},
"aggregateRating": {"@type": "AggregateRating", "ratingValue": 0, "reviewCount": 0},
"offers": {
"@type": "AggregateOffer",
"highPrice": 32.97,
"lowPrice": 29.97,
"offerCount": 24,
"priceCurrency": "USD",
"offers": [
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37621",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-1",
"alternateName": "I Got Toddler Problems Tee - Mauve/S",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37622",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-2",
"alternateName": "I Got Toddler Problems Tee - Mauve/M",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37623",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-3",
"alternateName": "I Got Toddler Problems Tee - Mauve/L",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
],
"shippingDetails": {
"@type": "OfferShippingDetails",
"shippingRate": {
"@type": "MonetaryAmount",
"value": "0",
"currency": "USD",
},
},
},
}
i want to extract all variant name , imag url , size , color by extracting variant url i want to getting back like this way i want data in this layout any one please help me i,m learning python here is my code
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld json').string
data = json.loads(scripts)
image = data["image"]
try:
altname = data["offers"]["offers"]
except KeyError:
print("not found")
for item in altname:
area = item["alternateName"]
detail = {"image": image, "name": area}
print(detail)
newlist.append(detail)
print("saving")
df = pd.DataFrame(newlist)
df.to_csv("first_list.csv")
I'm getting back this, all images urls in one cell inspite infront of variant color i,m getting back this way
CodePudding user response:
The solution is provided based on one json file (one product). Both uploaded screenshots are the same. It is better to use data.get('key') instead of data['key'].
[data.get("name")] [""] * (len(offer) - 1) to create same length columns, otherwise we get error when we create a data frame, because product name is inside the cells just the first time.
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld json').string
# if below line did not work try with data = json.loads(scripts)
data = json.loads(json.dumps(scripts))
size, color, url = [], [], []
offer = data.get("offers").get("offers")
product_name = [data.get("name")] [""] * (len(offer) - 1)
if offer:
for item in offer:
size_color_list = item["alternateName"].split(" - ")[1].split("/")
url.append(item["url"])
color.append(size_color_list[0])
size.append(size_color_list[1])
detail = {
"product_name": product_name,
"variant_color_name": color,
"variant_size": size,
"variant_image": url,
}
try:
df = pd.DataFrame(detail)
except Exception as e:
raise e
else:
df.index = 1
# df.to_csv('first_list.csv')
df.to_excel("first_list.xlsx")
