So I am scraping this product website that has multiple pages, and every page has tables with similar column but different column values. heres an example: https://www.benchmade.com/317-1-weekender.html and like wise heres another one: https://www.benchmade.com/15600or-raghorn.html and there are about 144 links like this.
What I want is a table where I could have all the similar columns grouped into one and be the column headers and the rows being the column values.
So something like this that could be outputted as a csv table:
Blade Length. | | Blade Thickness|| Open Length |--etc etc
|------------- | |----------------||-------------|
| 2.97/1.97" | | 4.34/12.54 || 1.23/5.65 |
| 4.24/2.23" | | 2.34/5.63 || 5.43/2.90 |
| 3.54/2.65 | | 2.57/6.54 || 6.90/4.20 |
| 7.65/5/43 | | 4.65/3.56 || 3.32/4.54 |
I have done this so far:
product_links = []
for x in range (1,4):
HTML = requests.get(f'https://www.benchmade.com/all-products.html?blade_edge=521,531,2231&p={x}&price=75-2400&product_list_limit=48',HEADER)
#HTML.status_code
Booti= soup(HTML.content, "lxml")
knife_items = Booti.find_all('li',class_= "item product product-item")
for items in knife_items:
for links in items.findAll('a', class_= "product photo product-item-photo", href = True):
product_links.append(links['href'])
for links_2 in product_links:
#testlinks = "https://www.benchmade.com/4010-211-collectors-edition-station-knife.html"
Specifications_data = pd.read_html(links_2)[0]
Any help would be appreciated!!! Thank you so much!
CodePudding user response:
Quite easy to do with pandas.
import pandas as pd
urls = ['https://www.benchmade.com/317-1-weekender.html',
'https://www.benchmade.com/15600or-raghorn.html']
final_df = pd.DataFrame()
for url in urls:
df = pd.read_html(url)[0].set_index(0).T
final_df = final_df.append(df, sort=False).reset_index(drop=True)
Output:
print(final_df)
0 Blade Length: Blade Thickness: ... Weight: Sheath Weight:
0 2.97/1.97" | 7.16/5.00cm 0.090" | 2.286mm ... 2.28oz | 64.64g NaN
1 4.64" | 11.78 cm 0.09" | 2.286mm ... COMING SOON 21.26g
CodePudding user response:
Let us first modify your code a bit to save the results in a list res:
product_links = []
res = []
for x in range (1,4):
... # continue your code
Specifications_data = pd.read_html(links_2)[0]
res.append(Specifications_data)
Now we put the data from product_links and res together in a dataframe; many ways to do it, eg like this (we use urls from product_links as index so you know which data corresponds to which knife)
res_dict = {k:dict(zip(v[0],v[1])) for k,v in zip(product_links, res)}
df = pd.DataFrame.from_dict(res_dict, orient='index')
You get a large df; df.head() now looks like this:
Blade Length: Blade Thickness: Open Length: Handle Thickness: Weight: Sheath Weight: Closed Length: Blade Edge Blade Finish/Color Blade Steel Blade Style/Shape Clip Type Clip Position Handle Material Lanyard Hole MOLLE Compatible Use Blade Style Product Box: Designer: Mechanism: Action: Blade Steel: Overall Length: Drop-point Blade Style with Valox Handle Tanto Blade Style with Valox Handle Drop-point Blade Style with G10/Aluminum Handle Drop-point Blade Style with G10 Handle Valox Handle G10/Aluminum Handle Drop-point Blade Style with G10 Drop-point Blade Style Tanto Blade Style Green and red contoured G10 handle Sand contoured G10 handle Handle Length: Opposing Bevel Blade Style Sheepsfoot Blade Style Aluminum Handles Carbon Fiber Handles G10 Handles Glass Breaker Sheath Type
------------------------------------------------------------------------ --------------- ------------------ ---------------- ------------------- ---------------- ---------------- ---------------- ------------ -------------------- ------------- ------------------- ----------- --------------- ----------------- -------------- ------------------ ----- ------------- -------------- ----------- ------------ --------- -------------- ----------------- ------------------------------------------ ------------------------------------- ------------------------------------------------- ---------------------------------------- -------------- --------------------- --------------------------------- ------------------------ ------------------- ------------------------------------ --------------------------- ---------------- ---------------------------- ------------------------ ------------------ ---------------------- ------------- --------------- -------------
https://www.benchmade.com/4010-211-collectors-edition-station-knife.html 5.97" | 15.16cm 0.114" | 2.896mm 10.88" | 27.64cm 0.61" | 15.44mm 6.92oz | 196.18g 1.27oz | 36.00g nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
https://www.benchmade.com/4000-211-collectors-edition-3-piece-set.html 8.04" | 20.42cm 0.114" | 2.896mm 13.02" | 33.07cm 0.61" | 15.44mm 7.37oz | 208.94g nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
https://www.benchmade.com/602-211-tengu-tool.html 1.14" | 2.90cm 0.124" | 3.150mm 3.27" | 8.31cmm 0.40" | 10.16mm 1.04oz | 29.48g 0.28oz | 7.94g 2.14" | 5.44cm nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
https://www.benchmade.com/9070bk-1-claymore.html 3.60" | 8.64cm 0.114" | 2.896mm 8.60" | 19.81cm 0.60" | 14.99mm 3.50oz | 97.24g nan 5.00" | 11.18cm nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
https://www.benchmade.com/9070bk-claymore.html 3.60" | 8.64cm 0.114" | 2.896mm 8.60" | 19.81cm 0.60" | 14.99mm 3.50oz | 97.24g nan 5.00" | 11.18cm nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
you can further massage the dataframe to get the columns you really need eg
df[['Blade Length:', 'Blade Thickness:', 'Open Length:',
'Handle Thickness:', 'Weight:']]
and so on
