I need to scrape a website which has a 'table' like paragraph and I want to put it as a DataFrame on python.
I need to get the Name, Price and the description of the page and put it all in a DataFrame format. The problem is that I can scrape all of it individually, but I can't get them to a proper DataFrame.
Here is what I have done so far:
I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
#HTML.status_code
Booti= soup(HTML.content, "lxml")
knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
product_links.append(baseURL links['href'])
And then I scrape the individual web pages here:
Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
try:
for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
Name.append(N.text.replace('\n', '').strip())
for P in Booti2.findAll('span',{'class': "price" }):
Price.append(P.text.replace('\n', '').strip())
Contents = Booti2.find('div',class_= "rte text--pull")
for S in Contents.find_all('span'):
Specific.append(S.text)
except:
continue
So I need to get all the information in this format:
Name. | | Price || Model Number Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 2 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 3 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 4 | | $$ || XXXX | ABC. || XX"/XXcm. |
...and so on with rest of the columns from the web pages. Any help would be appreciated!! Thank you so much!!
CodePudding user response:
I'll update this in a minute but try something like this:
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import re
baseURL = 'https://www.civivi.com'
product_links = []
header = {}
for x in range(1, 2):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}', header)
# HTML.status_code
Booti = soup(HTML.content, "lxml")
knife_items = Booti.find_all('div', class_="product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs={
'class': 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href=True):
product_links.append(baseURL links['href'])
# dataframe that will hold the final resulting data
final = pd.DataFrame()
for links in product_links:
HTML2 = requests.get(links, header)
Booti2 = soup(HTML2.content,"lxml")
try:
buffer = pd.DataFrame(
[[
# name
Booti2.find('h1', class_='product-meta__title heading h1').text.strip(),
# price
Booti2.find('div', class_='price-list').find('span').text,
# if you don't want $ do this: Booti2.find('div', class_='price-list').find('span').text[1:]
# Model Number
str(Booti2(text=re.compile(r'(?:Model Number: )'))[4]).replace('Model Number: ', ''),
# Model Name - using [4] is not the best way. I think the regex could be better or something.
str(Booti2(text=re.compile(r'(?:Model Name: )'))[4]).replace('Model Name: ', ''),
# Overall Length
str(Booti2(text=re.compile(r'(?:Overall Length: )'))[4]).replace('Overall Length: ', '')
]],
columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length']
)
final = final.append(buffer)
except:
continue
EDIT: Answer to your question about [4]:
I was trying to find a way to search the tag with the relevant text. i.e. "Model Number", "Model Name", and "Overall Length". I was trying to do this using regular expressions (re library) that's the text=re.compile part. So originally I was trying to do something like:
Botti2.find_all('span', text=re.compile(r'Model Number')) # these attributes are in <span> tags
For some reason it wasn't working correctly so I just modified to find all instances of those words.
Booti2(text=re.compile(r'(?:Overall Length: )'))
The line above returns 5 instances. You can see yourself by setting a breakpoint on that line. Index [4] just means the last instance which happens to be the right text. I don't think it's the best solution because it could easily break or not work as expected.
If you want to add other attributes just copy paste one of the other attributes and change the text, for example:
str(Booti2(text=re.compile(r'(?:Blade Length: )'))[4]).replace('Blade Length: ', '') # order of attributes here must match column names order
and then update the column names
columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length', 'new column name here']
