Parsing multiple Tables from multiple URLs with similar column names-CodePudding

So I am scraping this product website that has multiple pages, and every page has tables with similar column but different column values. heres an example: https://www.benchmade.com/317-1-weekender.html and like wise heres another one: https://www.benchmade.com/15600or-raghorn.html and there are about 144 links like this.

What I want is a table where I could have all the similar columns grouped into one and be the column headers and the rows being the column values.

So something like this that could be outputted as a csv table:

 Blade Length. | | Blade Thickness|| Open Length |--etc etc
|------------- | |----------------||-------------| 
| 2.97/1.97"   | |  4.34/12.54    ||  1.23/5.65  |  
| 4.24/2.23"   | |  2.34/5.63     ||  5.43/2.90  |  
| 3.54/2.65    | |  2.57/6.54     ||  6.90/4.20  |  
| 7.65/5/43    | |  4.65/3.56     ||  3.32/4.54  |

I have done this so far:

product_links = []
for x in range (1,4):
    HTML = requests.get(f'https://www.benchmade.com/all-products.html?blade_edge=521,531,2231&p={x}&price=75-2400&product_list_limit=48',HEADER)
    #HTML.status_code
    Booti= soup(HTML.content, "lxml")
    knife_items = Booti.find_all('li',class_= "item product product-item")
    
    for items in knife_items:
        for links in items.findAll('a', class_= "product photo product-item-photo", href = True):
            product_links.append(links['href'])
for links_2 in product_links:
#testlinks = "https://www.benchmade.com/4010-211-collectors-edition-station-knife.html"
    Specifications_data = pd.read_html(links_2)[0]

Any help would be appreciated!!! Thank you so much!

CodePudding user response：

Quite easy to do with pandas.

import pandas as pd

urls = ['https://www.benchmade.com/317-1-weekender.html',
       'https://www.benchmade.com/15600or-raghorn.html']

final_df = pd.DataFrame()
for url in urls:
    df = pd.read_html(url)[0].set_index(0).T
    final_df = final_df.append(df, sort=False).reset_index(drop=True)

Output:

print(final_df)
0             Blade Length:  Blade Thickness:  ...          Weight: Sheath Weight:
0  2.97/1.97" | 7.16/5.00cm  0.090" | 2.286mm  ...  2.28oz | 64.64g            NaN
1          4.64" | 11.78 cm   0.09" | 2.286mm  ...      COMING SOON         21.26g

CodePudding user response：

Let us first modify your code a bit to save the results in a list res:

product_links = []
res = []
for x in range (1,4):
... # continue your code
    Specifications_data = pd.read_html(links_2)[0]
    res.append(Specifications_data)

Now we put the data from product_links and res together in a dataframe; many ways to do it, eg like this (we use urls from product_links as index so you know which data corresponds to which knife)

res_dict = {k:dict(zip(v[0],v[1])) for k,v in zip(product_links, res)}
df = pd.DataFrame.from_dict(res_dict, orient='index')

You get a large df; df.head() now looks like this:

                                                                          Blade Length:    Blade Thickness:    Open Length:      Handle Thickness:    Weight:           Sheath Weight:    Closed Length:      Blade Edge    Blade Finish/Color    Blade Steel    Blade Style/Shape    Clip Type    Clip Position    Handle Material    Lanyard Hole    MOLLE Compatible    Use    Blade Style    Product Box:    Designer:    Mechanism:    Action:    Blade Steel:    Overall Length:    Drop-point Blade Style with Valox Handle    Tanto Blade Style with Valox Handle    Drop-point Blade Style with G10/Aluminum Handle    Drop-point Blade Style with G10 Handle    Valox Handle    G10/Aluminum Handle    Drop-point Blade Style with G10    Drop-point Blade Style    Tanto Blade Style    Green and red contoured G10 handle    Sand contoured G10 handle    Handle Length:    Opposing Bevel Blade Style    Sheepsfoot Blade Style    Aluminum Handles    Carbon Fiber Handles    G10 Handles    Glass Breaker    Sheath Type
------------------------------------------------------------------------  ---------------  ------------------  ----------------  -------------------  ----------------  ----------------  ----------------  ------------  --------------------  -------------  -------------------  -----------  ---------------  -----------------  --------------  ------------------  -----  -------------  --------------  -----------  ------------  ---------  --------------  -----------------  ------------------------------------------  -------------------------------------  -------------------------------------------------  ----------------------------------------  --------------  ---------------------  ---------------------------------  ------------------------  -------------------  ------------------------------------  ---------------------------  ----------------  ----------------------------  ------------------------  ------------------  ----------------------  -------------  ---------------  -------------
https://www.benchmade.com/4010-211-collectors-edition-station-knife.html  5.97" | 15.16cm  0.114" | 2.896mm    10.88" | 27.64cm  0.61" | 15.44mm      6.92oz | 196.18g  1.27oz | 36.00g   nan                        nan                   nan            nan                  nan          nan              nan                nan             nan                 nan    nan            nan             nan          nan           nan        nan             nan                nan                                         nan                                    nan                                                nan                                       nan             nan                    nan                                nan                       nan                  nan                                   nan                          nan               nan                           nan                       nan                 nan                     nan            nan              nan            nan
https://www.benchmade.com/4000-211-collectors-edition-3-piece-set.html    8.04" | 20.42cm  0.114" | 2.896mm    13.02" | 33.07cm  0.61" | 15.44mm      7.37oz | 208.94g  nan               nan                        nan                   nan            nan                  nan          nan              nan                nan             nan                 nan    nan            nan             nan          nan           nan        nan             nan                nan                                         nan                                    nan                                                nan                                       nan             nan                    nan                                nan                       nan                  nan                                   nan                          nan               nan                           nan                       nan                 nan                     nan            nan              nan            nan
https://www.benchmade.com/602-211-tengu-tool.html                         1.14" | 2.90cm   0.124" | 3.150mm    3.27" | 8.31cmm   0.40" | 10.16mm      1.04oz | 29.48g   0.28oz | 7.94g    2.14" | 5.44cm             nan                   nan            nan                  nan          nan              nan                nan             nan                 nan    nan            nan             nan          nan           nan        nan             nan                nan                                         nan                                    nan                                                nan                                       nan             nan                    nan                                nan                       nan                  nan                                   nan                          nan               nan                           nan                       nan                 nan                     nan            nan              nan            nan
https://www.benchmade.com/9070bk-1-claymore.html                          3.60" | 8.64cm   0.114" | 2.896mm    8.60" | 19.81cm   0.60" | 14.99mm      3.50oz | 97.24g   nan               5.00" | 11.18cm            nan                   nan            nan                  nan          nan              nan                nan             nan                 nan    nan            nan             nan          nan           nan        nan             nan                nan                                         nan                                    nan                                                nan                                       nan             nan                    nan                                nan                       nan                  nan                                   nan                          nan               nan                           nan                       nan                 nan                     nan            nan              nan            nan
https://www.benchmade.com/9070bk-claymore.html                            3.60" | 8.64cm   0.114" | 2.896mm    8.60" | 19.81cm   0.60" | 14.99mm      3.50oz | 97.24g   nan               5.00" | 11.18cm            nan                   nan            nan                  nan          nan              nan                nan             nan                 nan    nan            nan             nan          nan           nan        nan             nan                nan                                         nan                                    nan                                                nan                                       nan             nan                    nan                                nan                       nan                  nan                                   nan                          nan               nan                           nan                       nan                 nan                     nan            nan              nan            nan

you can further massage the dataframe to get the columns you really need eg

df[['Blade Length:', 'Blade Thickness:', 'Open Length:',
       'Handle Thickness:', 'Weight:']]

and so on