Scraping with BeautifuldSoup to csv-CodePudding

This code does not crash when I ran it. The output file flyingmag.csv is populated but not as I want. I want to add div elementor-widget-container” > h3 so that both Airplane manufacturer and airplane model are included in the output. I want really the records to be in a traditional excel row format as well as scrape all aircraft manufacturers and models

import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://www.flyingmag.com/2019-buyers-single-engine-piston/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

with open('flyingmag.csv', "w", encoding="utf-8-sig") as f:
    writer = csv.writer(f)    
    writer.writerow(['Base_Price','Typically_Equipped_Price','Engine','Horsepower','Propeller','Seats','Length','Height','Wingspan','Wing_Area','Wing_Loading','Power_Loading','Max_Takeoff_Weight','Empty_Weight','Useful_Load','Fuel_Capacity','Max_Operating_Altitude','Max_Rate_of_Climb','Max_Cruise_Speed','Normal_Cruise_Speed','Never_Exceed_Speed','Stall_Speed-Flaps_Up','Stall_Speed-Landing_Configuration','Max_Range','Takeoff_Roll','Takeoff_Distance_Over_50_ft.','Landing_Roll','Landing_Distance_Over_50_ft'])

    while True:
        html = requests.get(url , headers = headers)
        soup = BeautifulSoup(html.text, 'html.parser')
      
        for row in soup.select('table tbody tr'):
            writer.writerow([c.text if c.text else '' for c in row.select('td')])
            print(row)
        else:
            break

CodePudding user response：

You can first work out the number of overarching "sections", or listings as I call them, by locating the h3 headers, which I do with section:has([data-widget_type="heading.default"]) then loop those and extract the manufacturer. Use find_next to move to the actual following sections containing the model and table. All data appears to be present on that single page if you scroll down to bottom.

With respect to headers:

td:not([colspan]) strong

The :not([colspan]) is used to exclude the last Back to Top row of each table for each listing. This is a "merged cell" with a colspan attribute and doesn't contain data you want. You could also have used an nth-child range selector. The first (or left most as you view page) and third table columns are used for the headers, and I access these only for the first listing. I checked that these same headers were present in all tables initially. The space strong is to then select for descendant strong elements, which are present for the 1st and 3rd td children in each row of the tables.

With respect to row values in csv after headers:

td:not([colspan]):nth-child(even)

The first part is as per the headers explanation. However, instead of then adding in a descendant combinator with strong type selector, I simply used nth-child(even); This selected for the 2nd and 4th columns as desired as these are the even numbered children.

import requests, csv

r = requests.get('https://www.flyingmag.com/2019-buyers-single-engine-piston')
soup = bs(r.content, 'lxml')
listings = soup.select('section:has([data-widget_type="heading.default"])')

with open('flyingmag.csv', "w", encoding="utf-8-sig", newline='') as f:
    
    writer = csv.writer(f, delimiter = ",", quoting=csv.QUOTE_MINIMAL)    
    
    for num, listing in enumerate(listings):
        
        manufacturer = listing.select_one('[data-widget_type="heading.default"] h2').text
        model = listing.find_next('h3').text
        table = listing.find_next('table')
        
        if num == 0:
            
            row = ['Manufacturer', 'Model']
            row.extend([i.text for i in table.select('td:not([colspan]) strong')])
            writer.writerow(row)
        
        values = [i.text for i in table.select('td:not([colspan]):nth-child(even)')]
        row = [manufacturer, model]
        row.extend(values)
        writer.writerow(row)