so I have this webdata that I scraped from a product website. I scraped it using BeautifulSoup, and scraped multiple pages from the product website. I get two lists from the scraper, one is of the specification and the other is the data for specification. Here is an example:
Blade length : 2.97/1.97"
Blade Thickness : 0.090/2.54"
Open Length : 7.05/6.05"
Closed Length: 4.08" | 9.78cm
Handle Thickness: 0.40" | 10.16mm
Weight: 2.28oz | 64.64g
I want to get the left hand side to be a Key for dictionary and the right hand side to be the value. The ultimate goal is to put it in a csv where I can have the left hand side to be the column headers for the data in the right hand side. Since I am scraping multiple pages, the left hand side repeats itself and the there are multiple values of the right hand side.
So the desired output should be something like this:
Blade Length. | | Blade Thickness|| Open Length |--etc etc
|------------- | |----------------||-------------|
| 2.97/1.97" | | 4.34/12.54 || 1.23/5.65 |
| 4.24/2.23" | | 2.34/5.63 || 5.43/2.90 |
| 3.54/2.65 | | 2.57/6.54 || 6.90/4.20 |
| 7.65/5/43 | | 4.65/3.56 || 3.32/4.54 |
so if there is a better way to do this than dictionaries then please let me know!
The HTML is something like this:
<table >
<tbody>
<tr>
<th >Blade Length:</th>
<td >2.97/1.97"</td>
</tr>
<tr>
<th >Blade Thickness:</th>
<td >0.090"</td>
</tr>
<tr>
<th >Open Length: </th>
<td >7.05/6.05"</td>
</tr>
<tr>
<th >Closed Length: </th>
<td >4.08"</td>
</tr>
<tr>
<th >Handle Thickness:</th>
<td >0.40" </td>
</tr>
<tr>
<th >Weight:</th>
<td >2.28oz</td>
</tr>
</tbody>
</table>
Here is my attempt to get this data:
Specs = []
Specs_Datas = defaultdict(list)
Specs2 = []
for links in product_links:
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
table_feature = Booti2.select_one('#product-attribute-specs-table')
#find all rows
try:
for S in Booti2.find_all('th', attrs ={'class': 'col label'}):
Specs.append(S.text.replace('\n', '').strip())
unique_specs = np.unique(Specs).tolist()
while unique_specs in Specs:
for SD in Booti2.find_all('td', attrs ={'class': 'col value'}):
Specs2.append(SD.text.replace('\n', '').strip())
Specs_Datas[unique_specs] = []
Specs_Datas[unique_specs].update(SD.text.replace('\n', '').strip())
#Specs.append(S.text.replace('\n', '').strip())
except:
continue
Any help would be appreciated!! Thank you so much!!!
CodePudding user response:
You can let pandas parse the <table> tag and transpose it. Then you append the transposed tables into a result_df.
It's tough to show with the limited code you have here, but this is the basic idea:
import pandas as pd
html = '''<table >
<tbody>
<tr>
<th >Blade Length:</th>
<td >2.97/1.97"</td>
</tr>
<tr>
<th >Blade Thickness:</th>
<td >0.090"</td>
</tr>
<tr>
<th >Open Length: </th>
<td >7.05/6.05"</td>
</tr>
<tr>
<th >Closed Length: </th>
<td >4.08"</td>
</tr>
<tr>
<th >Handle Thickness:</th>
<td >0.40" </td>
</tr>
<tr>
<th >Weight:</th>
<td >2.28oz</td>
</tr>
</tbody>
</table>'''
df = pd.read_html(html)[0].set_index(0).T
Output:
print(df)
0 Blade Length: Blade Thickness: ... Handle Thickness: Weight:
1 2.97/1.97" 0.090" ... 0.40" 2.28oz
CodePudding user response:
Assuming you actually have two lists as:
specs = [
"Blade length",
"Blade Thickness",
"Open Length",
"Closed Length",
"Handle Thickness",
"Weight"
]
and
spec_data = [
"2.97/1.97\"",
"0.090/2.54\"",
"7.05/6.05\"",
"4.08\" | 9.78cm",
"0.40\" | 10.16mm",
"2.28oz | 64.64g"
]
The easiest path might be to zip() them together:
specs_reshaped = [
{key: value for key, value in zip(specs, spec_data)}
]
Then use a DictWriter()
with open("output.csv", "w", newline="", encoding="utf-8") as file_out:
writer = csv.DictWriter(file_out, fieldnames=specs)
writer.writeheader()
writer.writerows(specs_reshaped)
This produces a file like:
Blade length,Blade Thickness,Open Length,Closed Length,Handle Thickness,Weight
"2.97/1.97""","0.090/2.54""","7.05/6.05""","4.08"" | 9.78cm","0.40"" | 10.16mm",2.28oz | 64.64g
