Web Scraping Python - Pubs-CodePudding

I am trying to extract the site name and address data from this website for each card but this doesn't seem to work. Any suggestions?

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://order.marstons.co.uk/")

all_cards = driver.find_elements_by_xpath("//div[@class='h3.body__heading']/div[1]")
for card in all_cards:
    print(card.text)  # do as you will

CodePudding user response：

I use Firefox but it should work also for Chrome.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# driver = webdriver.Chrome(ChromeDriverManager().install())
driver = webdriver.Firefox()
driver.get("https://order.marstons.co.uk/")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="app"]/div/div/div/div[2]/div'))
    ).find_elements_by_tag_name('a')
    for el in element:
        print("heading",  el.find_element_by_tag_name('h3').text)
        print("address", el.find_element_by_tag_name('p').text)
finally:
    driver.quit()

CodePudding user response：

I'm glad that you are trying to help yourself, it seems you are new to this so let me offer some help.

Automating a browser via Selenium to do this is going to take you forever, the Marston's site is pretty straightforward to scrape if you know where to look: If you open your browser Developer Tools (F12 on pc) then - Network tab - fetch/Xhr and then hit refresh while on the Marston's site you'll see some backend api calls happening. If you click on the one that says "brand" then click the "preview" tab that should be available, you'll see a collapsible list of all sorts of information, that is a JSON file which is essentially a collection of python lists and dictionaries which make it easier to get the data you are after. The information in the "venue" list is going to be helpful when it comes to scraping the menus for each venue.

When you go to a specific pub you'll see an api call with the pubs name, this has all the menu info which you can see in the same way and we can make calls to these venue api's using the "slug" data from the venues response above.

So by making our own requests to these URLs and stepping through the JSON and getting the data we want we can have everything done in a couple minutes, far easier than trying to do this automating a browser! I've written the code below, feel free to ask questions if anything is unclear you'll need to pip install requests and pandas to make this work. You owe me a pint! :) Cheers

import requests
import pandas as pd

headers = {'origin':'https://order.marstons.co.uk'}
url = 'https://api-cdn.orderbee.co.uk/brand'
resp = requests.get(url,headers=headers).json()

venues = {}
for venue in resp['venues']:
    venues[venue['slug']] = venue

print(f'{len(venues)} venues to scrape')

output = []
for venue in venues.keys():
    try:
        url = f'https://api-cdn.orderbee.co.uk/venues/{venue}'
        print(f'Scraping: {venues[venue]["name"]}')
        try:
            info = requests.get(url,headers=headers).json()
        except Exception as e:
            print(e)
            print(f'{venues[venue]["name"]} not available')
            continue

        for category in info['menus']['oat']['categories']: #oat = order at table?
            cat_name = category['name']
            for subcat in category['subCategories']:
                subcat_name = subcat['name']
                for item in subcat['items']:

                    info = {
                        'venue_name': venues[venue]['name'],
                        'venue_city': venues[venue]['address']['city'],
                        'venue_address': venues[venue]['address']['streetAddress'],
                        'venue_postcode': venues[venue]['address']['postCode'],
                        'venue_latlng': venues[venue]['address']['location']['coordinates'],
                        'category':cat_name,
                        'subcat':subcat_name,
                        'item_name' : item['name'],
                        'item_price' : item['price'],
                        'item_id' : item['id'],
                        'item_sku' : item['sku'],
                        'item_in_stock' : item['inStock'],
                        'item_active' : item['isActive'],
                        'item_last_update': item['updatedAt'],
                        'item_diet': item['diet']
                        }

                    output.append(info)
    except Exception as e:
        print(f'Problem scraping {venues[venue]["name"]}, skipping it') #when there is no menu available for some reason? Closed location?
        continue

df = pd.DataFrame(output)
df.to_csv('marstons_dump.csv',index=False)