Home > database >  Parsing Script tag raw data into csv with python
Parsing Script tag raw data into csv with python

Time:01-11

Basically i,m scraping data from a web that is availabe on their script tag but i,m unable to extract data into proper layout , there is my script tag raw data

{
    "@context": "https://schema.org/",
    "@type": "Product",
    "name": "I Got Toddler Problems Tee",
    "url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136",
    "sku": "BMRSUQNGGS",
    "image": [
        "https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsmauv.png",
        "https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsltgray.png",
        "https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsblk.png",
        "https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemspk.png",
    ],
    "description": "BMRSUQNGGS",
    "brand": {"@type": "Thing", "name": "InspireUplift"},
    "aggregateRating": {"@type": "AggregateRating", "ratingValue": 0, "reviewCount": 0},
    "offers": {
        "@type": "AggregateOffer",
        "highPrice": 32.97,
        "lowPrice": 29.97,
        "offerCount": 24,
        "priceCurrency": "USD",
        "offers": [
            {
                "@type": "Offer",
                "url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37621",
                "priceCurrency": "USD",
                "sku": "BMRSUQNGGS-1",
                "alternateName": "I Got Toddler Problems Tee - Mauve/S",
                "price": 29.97,
                "priceValidUntil": "2022-01-10",
                "availability": "https://schema.org/InStock",
                "seller": {"@type": "Organization", "name": "InspireUplift"},
            },
            {
                "@type": "Offer",
                "url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37622",
                "priceCurrency": "USD",
                "sku": "BMRSUQNGGS-2",
                "alternateName": "I Got Toddler Problems Tee - Mauve/M",
                "price": 29.97,
                "priceValidUntil": "2022-01-10",
                "availability": "https://schema.org/InStock",
                "seller": {"@type": "Organization", "name": "InspireUplift"},
            },
            {
                "@type": "Offer",
                "url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37623",
                "priceCurrency": "USD",
                "sku": "BMRSUQNGGS-3",
                "alternateName": "I Got Toddler Problems Tee - Mauve/L",
                "price": 29.97,
                "priceValidUntil": "2022-01-10",
                "availability": "https://schema.org/InStock",
                "seller": {"@type": "Organization", "name": "InspireUplift"},
            },
        ],
        "shippingDetails": {
            "@type": "OfferShippingDetails",
            "shippingRate": {
                "@type": "MonetaryAmount",
                "value": "0",
                "currency": "USD",
            },
        },
    },
}

i want to extract all variant name , imag url , size , color by extracting variant url i want to getting back like this way i want data in this layout any one please help me i,m learning python here is my code

r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld json').string
data = json.loads(scripts)
image = data["image"]
try:
    altname = data["offers"]["offers"]
except KeyError:
    print("not found")
for item in altname:
    area = item["alternateName"]
    detail = {"image": image, "name": area}
    print(detail)
    newlist.append(detail)
    print("saving")
df = pd.DataFrame(newlist)
df.to_csv("first_list.csv")

I'm getting back this, all images urls in one cell inspite infront of variant color i,m getting back this way

CodePudding user response:

The solution is provided based on one json file (one product). Both uploaded screenshots are the same. It is better to use data.get('key') instead of data['key'].

[data.get("name")] [""] * (len(offer) - 1) to create same length columns, otherwise we get error when we create a data frame, because product name is inside the cells just the first time.

r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld json').string
# if below line did not work try with data = json.loads(scripts)
data = json.loads(json.dumps(scripts))
size, color, url = [], [], []

offer = data.get("offers").get("offers")

product_name = [data.get("name")]   [""] * (len(offer) - 1)
if offer:
    for item in offer:
        size_color_list = item["alternateName"].split(" - ")[1].split("/")
        url.append(item["url"])
        color.append(size_color_list[0])
        size.append(size_color_list[1])

detail = {
    "product_name": product_name,
    "variant_color_name": color,
    "variant_size": size,
    "variant_image": url,
}

try:
    df = pd.DataFrame(detail)
except Exception as e:
    raise e
else:
    df.index  = 1
    # df.to_csv('first_list.csv')
    df.to_excel("first_list.xlsx")
  •  Tags:  
  • Related