| 4G bands (A) | 4G bands (B) | 5G bands (A) | 5G bands (B) | 5G bands (C) |
|---|---|---|---|---|
| 1, 2, 3 - A2643 | 1, 2, 3 - A2484 | 1, 2, 3 - A2643 | 1, 2, 3 - A2484 | 1, 2, 3 - A2641 |
How do I get the above mentioned otput from the following html table?
<table cellspacing="0">
<tr >
<td ><a href="network-bands.php3">4G bands</a></td>
<td data-spec="net4g">1, 2, 3 - A2643</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3 - A2484</td>
</tr>
<tr >
<td ><a href="network-bands.php3">5G bands</a></td>
<td data-spec="net5g">1, 2, 3 - A2643</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3 - A2484</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3 - A2641</td>
</tr>
</table>
My problem is this rows: <tr data-spec-optional>. They are not bound by any hierarchy.
I have developed a spider that successfully crawls the entire structure of an online portal and collects the necessary information. Only these optional rows cause me problems.
This approach has given me hope. But I was not successful in implementing it. how to select and extract texts between two elements?
Any kind of help would be really helpful!
CodePudding user response:
Honestly I'm sure it's not the best way but it works. If I'll think of another way then I'll edit the answer.
In [1]: html="""<html>
...: <body>
...: <table cellspacing="0">
...:
...: <tr >
...: <td ><a href="network-bands.php3">4G bands</a></td>
...: <td data-spec="net4g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr >
...: <td ><a href="network-bands.php3">5G bands</a></td>
...: <td data-spec="net5g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2641</td>
...: </tr>
...:
...: </table>
...: </body>
...: </html>"""
In [2]: from scrapy import Selector
In [3]: response = Selector(text=html)
In [4]: for row in response.xpath('//tr[@]'):
...: if row.xpath('.//a'):
...: ch = 'A'
...: title1 = row.xpath('.//a/text()').get()
...: else:
...: ch = chr(ord(ch) 1)
...: title = title1 f' ({ch})'
...: data = row.xpath('.//td[@]/text()').get()
...: print(f'\"{title}\" : \"{data}\"')
...:
"4G bands (A)" : "1, 2, 3 - A2643"
"4G bands (B)" : "1, 2, 3 - A2484"
"5G bands (A)" : "1, 2, 3 - A2643"
"5G bands (B)" : "1, 2, 3 - A2484"
"5G bands (C)" : "1, 2, 3 - A2641"
CodePudding user response:
This answer is inspired by @SuperUser's answer of using the ordinal value to increment the count. You can obtain the headers and values using xpath and then combine them to form the final dictionary
import re
titles = response.xpath("//tr[@class='tr-toggle']/td[@class='ttl']/descendant-or-self::*/text()").getall()
for i, title in enumerate(titles):
if title.strip() == '':
prev_letter = re.search(r"\((.)\)$", titles[i-1]).group(1)
new_title = f"{titles[i-1][:-4]} ({chr(ord(prev_letter) 1)})"
titles[i] = new_title
else:
titles[i] = f"{title} (A)"
values = response.xpath("//tr[@class='tr-toggle']/td[@class='nfo']/text()").getall()
results = {}
for title, value in zip(titles, values):
results[title] = value
When you print the results dictionary you obtain the below
{'4G bands (A)': '1, 2, 3 - A2643',
'4G bands (B)': '1, 2, 3 - A2484',
'5G bands (A)': '1, 2, 3 - A2643',
'5G bands (B)': '1, 2, 3 - A2484',
'5G bands (C)': '1, 2, 3 - A2641'}
