This is my first question, so I hope i am asking this in the right place and that the question is appropriate.

I am using python an selenium to collect data from this website: https://www.sqdc.ca I am able to scrape the homepage and collect a list of the main categories of products. I am also able to go into each category's pages and collect the information there on each product (for example there: https://www.sqdc.ca/en-CA/dried-cannabis?fn1=InStock&fv1=in store|online&origin=dropdown&c1=products&c2=dried-cannabis&clickedon=dried-cannabis). I also manage to get URLs for all the products in an attempt to collect more detail on each product.

I have been stuck on this last step for some time now. When i attempt to go into each product's page to get more detail (for example here: https://www.sqdc.ca/en-CA/p-apples-cream/671148904118-P/671148904118), i am unable to find the section of the stores list that shows the availability and inventory, which loads immediately in my browser

When i look at page source in the browser, this is the section that i am after:

<div id="storesList" >
<div data-templateid="StoreInventoryList">
<p >Unavailable</p>
</div>

No idea why it is unavailable. Ideally i would like to get that list, and click on "see more stores" until they all load.

I have tried to wait but that did not work, and in any case it seems like that list is already loaded when i land on the page.

Any thoughts? I know the list in generated by javascript since when i inspect the page in my browser, i see a class called row-js-equalize.

the code:

#Setting up the driver and options

options = webdriver.ChromeOptions()
options.add_argument('start-maximized')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('headless')
options.add_argument('no-sandbox')
options.add_argument("window-size=1200x600")
driver = webdriver.Chrome("/home/amr/Downloads/chromedriver", options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))

Getting page and parsing

driver.get(product['https://www.sqdc.ca/en-CA/p-cbd-decarb/628634303078-P/628634303078'])
content = driver.page_source
soup = BeautifulSoup(content, "lxml")

if you go to the url, the section on the bottom with the stores and inventory is what i am after. I cannot find it in the parsed xml

CodePudding user response：

You don't need to use Selenium to get the inventory, in your browser you can find the backend api call to the inventory endpoint: https://www.sqdc.ca/api/olivestoreinventory/getstoresinventory

To find this open your browser's Developer Tools - Network tab - fetch/Xhr and refresh the page, all the details you want are loaded up from various backend api calls. We can recreate them like this:

import requests

headers =   {
    'accept-language': 'en-CA', #import to keep this header for some reason
    'x-requested-with':'XMLHttpRequest'#import to keep this header
    }

url = 'https://www.sqdc.ca/api/olivestoreinventory/getstoresinventory'
payload = {"Sku":"671148904118","Page":1,"Pagesize":1000} #Pagesize is basically number of stores, get all stores with 1000, SKU comes from url

resp = requests.post(url,headers=headers,json=payload).json()
print(len(resp['Stores']))

inventory = {x['Name']:x['InventoryStatus']['Quantity'] for x in resp['Stores']} #store_name : inventory count
print(inventory)

I've parsed the json response and created an inventory dict that has store name and inventory levels for 82 stores for this SKU, you can recreate this for any product as long as you send the SKU number in the payload