Home > Enterprise >  How can i scrape a table with sub-rows that do not belong to any hierarchy?
How can i scrape a table with sub-rows that do not belong to any hierarchy?

Time:01-27

4G bands (A) 4G bands (B) 5G bands (A) 5G bands (B) 5G bands (C)
1, 2, 3 - A2643 1, 2, 3 - A2484 1, 2, 3 - A2643 1, 2, 3 - A2484 1, 2, 3 - A2641

How do I get a structured output as above from a table like this?

<table cellspacing="0">

    <tr >
    <th rowspan="15" scope="row">Network</th>
    <td ><a href="network-bands.php3">Technology</a></td>
    <td ><a href="#"  data-spec="nettech">GSM / CDMA / HSPA / EVDO / LTE / 5G</a></td>
    </tr>

    <tr >
    <td ><a href="network-bands.php3">2G bands</a></td>
    <td  data-spec="net2g">GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM)</td>
    </tr>
    
    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >CDMA 800 / 1900 </td>
    </tr>

    <tr >
    <td ><a href="network-bands.php3">3G bands</a></td>
    <td  data-spec="net3g">HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 </td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >CDMA2000 1xEV-DO </td>
    </tr>

    <tr >
    <td ><a href="network-bands.php3">4G bands</a></td>
    <td  data-spec="net4g">1, 2, 3, 4, 5, 7, 8, 12, 13, 17, 18, 19, 20, 25, 26, 28, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66 - A2643, A2644, A2645</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3, 4, 5, 7, 8, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 26, 28, 29, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66, 71 - A2484, A2641</td>
    </tr>

    <tr >
    <td ><a href="network-bands.php3">5G bands</a></td>
    <td  data-spec="net5g">1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 30, 38, 40, 41, 48, 66, 77, 78, 79 SA/NSA/Sub6 - A2643, A2644</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 78, 79, 258, 260, 261 SA/NSA/Sub6/mmWave - A2484</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 77, 78, 79 SA/NSA/Sub6 - A2641</td>
    </tr>

    <tr >
    <td ><a href="glossary.php3?term=3g">Speed</a></td>
    <td  data-spec="speed">HSPA 42.2/5.76 Mbps, LTE-A, 5G, EV-DO Rev.A 3.1 Mbps</td>
    </tr>

</table>

Dealing with these rows is my problem: <tr data-spec-optional>. They are not bound by any hierarchy.

This approaches have given me hope. But I was not successful in implementing it.

  1. how to select and extract texts between two elements?

  2. Select sequence of next siblings in Scrapy

Any kind of help would be really helpful!

Edit

At the very bottom [LEVEL3] I have to implement the solution. This is my structure of the spider and code:

[Page LEVEL1] Brands URL https://www.gsmarena.com/makers.php3
[Page LEVEL2] All Devices URL https://www.gsmarena.com/apple-phones-48.php
[Page LEVEL3] Detail Page URL https://www.gsmarena.com/apple_iphone_13_pro_max-11089.php

import scrapy
from scrapy import Selector
from gsm.items import GsmItem

class GsmSpider(scrapy.Spider):
    name = 'gsm'
    allowed_domains = ['gsmarena.com']
    start_urls = ['https://gsmarena.com/makers.php3']

    # LEVEL1 | all brands

    def parse(self, response):
        
        item = GsmItem()

        gsms = response.xpath('//div[@]/table//tr[3]//td[2]')
        for gsm in gsms:
            allbranddevicesurl = gsm.xpath('.//a/@href').get()
            brandname = gsm.xpath('.//a/text()').get()
            devicecount = gsm.xpath('.//span/text()').get()
            
            item['brandname'] = brandname
            item['devicecount'] = devicecount

            yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
                                    meta= {'brandname': item,
                                           'devicecount': item})

    # LEVEL2 | all devices

    def parse_allbranddevicesurl(self, response):
        
        item = response.meta['brandname']       
        item = response.meta['devicecount'] 

        phones = response.xpath('//*[@id="review-body"]//li')
        for phone in phones:
            detailpageurl = phone.xpath('.//a/@href').get()
            
            item['detailpageurl'] = detailpageurl

            yield response.follow(detailpageurl,
                                    callback=self.parse_detailpage,
                                    meta= {'brandname': item,
                                           'devicecount': item,
                                           'detailpageurl': item,})

        next_page = response.xpath('//a[@]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
                                    meta= {'brandname': item,
                                           'devicecount': item,
                                           'detailpageurl': item,})

    # LEVEL3 | Detailpage
    
    def parse_detailpage(self, response):
     
        item = response.meta['brandname']       
        item = response.meta['devicecount']
        item = response.meta['detailpageurl']

 
        for row in response.xpath('//tr[@]'):
            if row.xpath('.//a'):
                ch = 'A'
                title1 = row.xpath('.//a/text()').get()
            else:
                ch = chr(ord(ch) 1)
            title = title1   f' ({ch})'
            data = row.xpath('.//td[@]/text()').get()
            
            item['title'] = title
            item['data'] = data

        yield item

CodePudding user response:

Honestly I'm sure it's not the best way but it works. If I'll think of another way then I'll edit the answer.

In [1]: html="""<html>
   ...: <body>
   ...: <table cellspacing="0">
   ...: 
   ...:    <tr >
   ...:     <td ><a href="network-bands.php3">4G bands</a></td>
   ...:     <td  data-spec="net4g">1, 2, 3 - A2643</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2484</td>
   ...:     </tr>
   ...: 
   ...:     <tr >
   ...:     <td ><a href="network-bands.php3">5G bands</a></td>
   ...:     <td  data-spec="net5g">1, 2, 3 - A2643</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2484</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2641</td>
   ...:     </tr>
   ...:     
   ...: </table>
   ...: </body>
   ...: </html>"""

In [2]: from scrapy import Selector

In [3]: response = Selector(text=html)

In [4]: for row in response.xpath('//tr[@]'):
   ...:     if row.xpath('.//a'):
   ...:         ch = 'A'
   ...:         title1 = row.xpath('.//a/text()').get()
   ...:     else:
   ...:         ch = chr(ord(ch) 1)
   ...:     title = title1   f' ({ch})'
   ...:     data = row.xpath('.//td[@]/text()').get()
   ...:     print(f'\"{title}\" : \"{data}\"')
   ...:
"4G bands (A)" : "1, 2, 3 - A2643"
"4G bands (B)" : "1, 2, 3 - A2484"
"5G bands (A)" : "1, 2, 3 - A2643"
"5G bands (B)" : "1, 2, 3 - A2484"
"5G bands (C)" : "1, 2, 3 - A2641"

CodePudding user response:

This answer is inspired by @SuperUser's answer of using the ordinal value to increment the count. You can obtain the headers and values using xpath and then combine them to form the final dictionary

import re

titles = response.xpath("//tr[@class='tr-toggle']/td[@class='ttl']/descendant-or-self::*/text()").getall()

for i, title in enumerate(titles):
    if title.strip() == '':
        prev_letter = re.search(r"\((.)\)$", titles[i-1]).group(1)
        new_title = f"{titles[i-1][:-4]} ({chr(ord(prev_letter)   1)})"
        titles[i] = new_title
    else:
        titles[i] = f"{title} (A)"

values = response.xpath("//tr[@class='tr-toggle']/td[@class='nfo']/text()").getall()

results = {}
for title, value in zip(titles, values):
    results[title] = value

When you print the results dictionary you obtain the below

{'4G bands (A)': '1, 2, 3 - A2643',
 '4G bands (B)': '1, 2, 3 - A2484',
 '5G bands (A)': '1, 2, 3 - A2643',
 '5G bands (B)': '1, 2, 3 - A2484',
 '5G bands (C)': '1, 2, 3 - A2641'}
  •  Tags:  
  • Related