Home > database >  Creating dictionary from two lists of webscraped data
Creating dictionary from two lists of webscraped data

Time:01-27

so I have this webdata that I scraped from a product website. I scraped it using BeautifulSoup, and scraped multiple pages from the product website. I get two lists from the scraper, one is of the specification and the other is the data for specification. Here is an example:

Blade length : 2.97/1.97" 
Blade Thickness : 0.090/2.54" 
Open Length :   7.05/6.05" 
Closed Length:  4.08" | 9.78cm
Handle Thickness:   0.40" | 10.16mm
Weight: 2.28oz | 64.64g

I want to get the left hand side to be a Key for dictionary and the right hand side to be the value. The ultimate goal is to put it in a csv where I can have the left hand side to be the column headers for the data in the right hand side. Since I am scraping multiple pages, the left hand side repeats itself and the there are multiple values of the right hand side.

So the desired output should be something like this:

 Blade Length. | | Blade Thickness|| Open Length |--etc etc
|------------- | |----------------||-------------| 
| 2.97/1.97"   | |  4.34/12.54    ||  1.23/5.65  |  
| 4.24/2.23"   | |  2.34/5.63     ||  5.43/2.90  |  
| 3.54/2.65    | |  2.57/6.54     ||  6.90/4.20  |  
| 7.65/5/43    | |  4.65/3.56     ||  3.32/4.54  |  

so if there is a better way to do this than dictionaries then please let me know!

The HTML is something like this:

<table >
     <tbody>
               <tr>
                    <th >Blade Length:</th>
                    <td >2.97/1.97"</td>
               </tr>
               <tr>
                    <th >Blade Thickness:</th>
                    <td >0.090"</td>
               </tr>
               <tr>
                    <th >Open Length: </th>
                    <td >7.05/6.05"</td>
               </tr>
               <tr>
                    <th >Closed Length: </th>
                    <td >4.08"</td>
               </tr>
               <tr>
                    <th >Handle Thickness:</th>
                    <td >0.40" </td>
               </tr>
               <tr>
                    <th >Weight:</th>
                    <td >2.28oz</td>
               </tr>
      </tbody>
</table>

Here is my attempt to get this data:

Specs = []
Specs_Datas = defaultdict(list)
Specs2 = []
for links in product_links:
    HTML2 = requests.get(links, HEADER)
    Booti2 = soup(HTML2.content,"html.parser") 
    table_feature = Booti2.select_one('#product-attribute-specs-table')
    #find all rows
    try:
        for S in Booti2.find_all('th', attrs ={'class': 'col label'}):
            Specs.append(S.text.replace('\n', '').strip())
            unique_specs = np.unique(Specs).tolist()
            while unique_specs in Specs:
                for SD in Booti2.find_all('td', attrs ={'class': 'col value'}):
                    Specs2.append(SD.text.replace('\n', '').strip())
                    Specs_Datas[unique_specs] = []
                    Specs_Datas[unique_specs].update(SD.text.replace('\n', '').strip())
            #Specs.append(S.text.replace('\n', '').strip())
            
    except:
        continue

Any help would be appreciated!! Thank you so much!!!

CodePudding user response:

You can let pandas parse the <table> tag and transpose it. Then you append the transposed tables into a result_df.

It's tough to show with the limited code you have here, but this is the basic idea:

import pandas as pd


html = '''<table >
     <tbody>
               <tr>
                    <th >Blade Length:</th>
                    <td >2.97/1.97"</td>
               </tr>
               <tr>
                    <th >Blade Thickness:</th>
                    <td >0.090"</td>
               </tr>
               <tr>
                    <th >Open Length: </th>
                    <td >7.05/6.05"</td>
               </tr>
               <tr>
                    <th >Closed Length: </th>
                    <td >4.08"</td>
               </tr>
               <tr>
                    <th >Handle Thickness:</th>
                    <td >0.40" </td>
               </tr>
               <tr>
                    <th >Weight:</th>
                    <td >2.28oz</td>
               </tr>
      </tbody>
</table>'''

df = pd.read_html(html)[0].set_index(0).T

Output:

print(df)
0 Blade Length: Blade Thickness:  ... Handle Thickness: Weight:
1    2.97/1.97"           0.090"  ...             0.40"  2.28oz

CodePudding user response:

Assuming you actually have two lists as:

specs = [
    "Blade length",
    "Blade Thickness",
    "Open Length",
    "Closed Length",
    "Handle Thickness",
    "Weight"
]

and

spec_data = [
    "2.97/1.97\"",
    "0.090/2.54\"",
    "7.05/6.05\"",
    "4.08\" | 9.78cm",
    "0.40\" | 10.16mm",
    "2.28oz | 64.64g"
]

The easiest path might be to zip() them together:

specs_reshaped = [
    {key: value for key, value in zip(specs, spec_data)}
]

Then use a DictWriter()

with open("output.csv", "w", newline="", encoding="utf-8") as file_out:
    writer = csv.DictWriter(file_out, fieldnames=specs)
    writer.writeheader()
    writer.writerows(specs_reshaped)

This produces a file like:

Blade length,Blade Thickness,Open Length,Closed Length,Handle Thickness,Weight
"2.97/1.97""","0.090/2.54""","7.05/6.05""","4.08"" | 9.78cm","0.40"" | 10.16mm",2.28oz | 64.64g
  •  Tags:  
  • Related