| 4G bands (A) | 4G bands (B) | 5G bands (A) | 5G bands (B) | 5G bands (C) |
|---|---|---|---|---|
| 1, 2, 3 - A2643 | 1, 2, 3 - A2484 | 1, 2, 3 - A2643 | 1, 2, 3 - A2484 | 1, 2, 3 - A2641 |
How do I get a structured output as above from a table like this?
<table cellspacing="0">
<tr >
<th rowspan="15" scope="row">Network</th>
<td ><a href="network-bands.php3">Technology</a></td>
<td ><a href="#" data-spec="nettech">GSM / CDMA / HSPA / EVDO / LTE / 5G</a></td>
</tr>
<tr >
<td ><a href="network-bands.php3">2G bands</a></td>
<td data-spec="net2g">GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM)</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >CDMA 800 / 1900 </td>
</tr>
<tr >
<td ><a href="network-bands.php3">3G bands</a></td>
<td data-spec="net3g">HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 </td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >CDMA2000 1xEV-DO </td>
</tr>
<tr >
<td ><a href="network-bands.php3">4G bands</a></td>
<td data-spec="net4g">1, 2, 3, 4, 5, 7, 8, 12, 13, 17, 18, 19, 20, 25, 26, 28, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66 - A2643, A2644, A2645</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3, 4, 5, 7, 8, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 26, 28, 29, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66, 71 - A2484, A2641</td>
</tr>
<tr >
<td ><a href="network-bands.php3">5G bands</a></td>
<td data-spec="net5g">1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 30, 38, 40, 41, 48, 66, 77, 78, 79 SA/NSA/Sub6 - A2643, A2644</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 78, 79, 258, 260, 261 SA/NSA/Sub6/mmWave - A2484</td>
</tr>
<tr data-spec-optional>
<td > </td>
<td >1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 77, 78, 79 SA/NSA/Sub6 - A2641</td>
</tr>
<tr >
<td ><a href="glossary.php3?term=3g">Speed</a></td>
<td data-spec="speed">HSPA 42.2/5.76 Mbps, LTE-A, 5G, EV-DO Rev.A 3.1 Mbps</td>
</tr>
</table>
Dealing with these rows is my problem: <tr data-spec-optional>. They are not bound by any hierarchy.
This approaches have given me hope. But I was not successful in implementing it.
Any kind of help would be really helpful!
Edit
At the very bottom [LEVEL3] I have to implement the solution. This is my structure of the spider and code:
[Page LEVEL1] Brands URL https://www.gsmarena.com/makers.php3
[Page LEVEL2] All Devices URL https://www.gsmarena.com/apple-phones-48.php
[Page LEVEL3] Detail Page URL https://www.gsmarena.com/apple_iphone_13_pro_max-11089.php
import scrapy
from scrapy import Selector
from gsm.items import GsmItem
class GsmSpider(scrapy.Spider):
name = 'gsm'
allowed_domains = ['gsmarena.com']
start_urls = ['https://gsmarena.com/makers.php3']
# LEVEL1 | all brands
def parse(self, response):
item = GsmItem()
gsms = response.xpath('//div[@]/table//tr[3]//td[2]')
for gsm in gsms:
allbranddevicesurl = gsm.xpath('.//a/@href').get()
brandname = gsm.xpath('.//a/text()').get()
devicecount = gsm.xpath('.//span/text()').get()
item['brandname'] = brandname
item['devicecount'] = devicecount
yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item})
# LEVEL2 | all devices
def parse_allbranddevicesurl(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
phones = response.xpath('//*[@id="review-body"]//li')
for phone in phones:
detailpageurl = phone.xpath('.//a/@href').get()
item['detailpageurl'] = detailpageurl
yield response.follow(detailpageurl,
callback=self.parse_detailpage,
meta= {'brandname': item,
'devicecount': item,
'detailpageurl': item,})
next_page = response.xpath('//a[@]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item,
'detailpageurl': item,})
# LEVEL3 | Detailpage
def parse_detailpage(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
item = response.meta['detailpageurl']
for row in response.xpath('//tr[@]'):
if row.xpath('.//a'):
ch = 'A'
title1 = row.xpath('.//a/text()').get()
else:
ch = chr(ord(ch) 1)
title = title1 f' ({ch})'
data = row.xpath('.//td[@]/text()').get()
item['title'] = title
item['data'] = data
yield item
CodePudding user response:
Honestly I'm sure it's not the best way but it works. If I'll think of another way then I'll edit the answer.
In [1]: html="""<html>
...: <body>
...: <table cellspacing="0">
...:
...: <tr >
...: <td ><a href="network-bands.php3">4G bands</a></td>
...: <td data-spec="net4g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr >
...: <td ><a href="network-bands.php3">5G bands</a></td>
...: <td data-spec="net5g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2641</td>
...: </tr>
...:
...: </table>
...: </body>
...: </html>"""
In [2]: from scrapy import Selector
In [3]: response = Selector(text=html)
In [4]: for row in response.xpath('//tr[@]'):
...: if row.xpath('.//a'):
...: ch = 'A'
...: title1 = row.xpath('.//a/text()').get()
...: else:
...: ch = chr(ord(ch) 1)
...: title = title1 f' ({ch})'
...: data = row.xpath('.//td[@]/text()').get()
...: print(f'\"{title}\" : \"{data}\"')
...:
"4G bands (A)" : "1, 2, 3 - A2643"
"4G bands (B)" : "1, 2, 3 - A2484"
"5G bands (A)" : "1, 2, 3 - A2643"
"5G bands (B)" : "1, 2, 3 - A2484"
"5G bands (C)" : "1, 2, 3 - A2641"
CodePudding user response:
This answer is inspired by @SuperUser's answer of using the ordinal value to increment the count. You can obtain the headers and values using xpath and then combine them to form the final dictionary
import re
titles = response.xpath("//tr[@class='tr-toggle']/td[@class='ttl']/descendant-or-self::*/text()").getall()
for i, title in enumerate(titles):
if title.strip() == '':
prev_letter = re.search(r"\((.)\)$", titles[i-1]).group(1)
new_title = f"{titles[i-1][:-4]} ({chr(ord(prev_letter) 1)})"
titles[i] = new_title
else:
titles[i] = f"{title} (A)"
values = response.xpath("//tr[@class='tr-toggle']/td[@class='nfo']/text()").getall()
results = {}
for title, value in zip(titles, values):
results[title] = value
When you print the results dictionary you obtain the below
{'4G bands (A)': '1, 2, 3 - A2643',
'4G bands (B)': '1, 2, 3 - A2484',
'5G bands (A)': '1, 2, 3 - A2643',
'5G bands (B)': '1, 2, 3 - A2484',
'5G bands (C)': '1, 2, 3 - A2641'}
