How can i scrape this html table correctly with scrapy?-CodePudding

4G bands (A)	4G bands (B)	5G bands (A)	5G bands (B)	5G bands (C)
1, 2, 3 - A2643	1, 2, 3 - A2484	1, 2, 3 - A2643	1, 2, 3 - A2484	1, 2, 3 - A2641

How do I get the above mentioned otput from the following html table?

<table cellspacing="0">

   <tr >
    <td ><a href="network-bands.php3">4G bands</a></td>
    <td  data-spec="net4g">1, 2, 3 - A2643</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3 - A2484</td>
    </tr>

    <tr >
    <td ><a href="network-bands.php3">5G bands</a></td>
    <td  data-spec="net5g">1, 2, 3 - A2643</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3 - A2484</td>
    </tr>

    <tr  data-spec-optional>
    <td >&nbsp;</td>
    <td >1, 2, 3 - A2641</td>
    </tr>
    
</table>

My problem is this rows: <tr data-spec-optional>. They are not bound by any hierarchy.

I have developed a spider that successfully crawls the entire structure of an online portal and collects the necessary information. Only these optional rows cause me problems.

This approach has given me hope. But I was not successful in implementing it. how to select and extract texts between two elements?

Any kind of help would be really helpful!

CodePudding user response：

Honestly I'm sure it's not the best way but it works. If I'll think of another way then I'll edit the answer.

In [1]: html="""<html>
   ...: <body>
   ...: <table cellspacing="0">
   ...: 
   ...:    <tr >
   ...:     <td ><a href="network-bands.php3">4G bands</a></td>
   ...:     <td  data-spec="net4g">1, 2, 3 - A2643</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2484</td>
   ...:     </tr>
   ...: 
   ...:     <tr >
   ...:     <td ><a href="network-bands.php3">5G bands</a></td>
   ...:     <td  data-spec="net5g">1, 2, 3 - A2643</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2484</td>
   ...:     </tr>
   ...: 
   ...:     <tr  data-spec-optional>
   ...:     <td >&nbsp;</td>
   ...:     <td >1, 2, 3 - A2641</td>
   ...:     </tr>
   ...:     
   ...: </table>
   ...: </body>
   ...: </html>"""

In [2]: from scrapy import Selector

In [3]: response = Selector(text=html)

In [4]: for row in response.xpath('//tr[@]'):
   ...:     if row.xpath('.//a'):
   ...:         ch = 'A'
   ...:         title1 = row.xpath('.//a/text()').get()
   ...:     else:
   ...:         ch = chr(ord(ch) 1)
   ...:     title = title1   f' ({ch})'
   ...:     data = row.xpath('.//td[@]/text()').get()
   ...:     print(f'\"{title}\" : \"{data}\"')
   ...:
"4G bands (A)" : "1, 2, 3 - A2643"
"4G bands (B)" : "1, 2, 3 - A2484"
"5G bands (A)" : "1, 2, 3 - A2643"
"5G bands (B)" : "1, 2, 3 - A2484"
"5G bands (C)" : "1, 2, 3 - A2641"

CodePudding user response：

This answer is inspired by @SuperUser's answer of using the ordinal value to increment the count. You can obtain the headers and values using xpath and then combine them to form the final dictionary

import re

titles = response.xpath("//tr[@class='tr-toggle']/td[@class='ttl']/descendant-or-self::*/text()").getall()

for i, title in enumerate(titles):
    if title.strip() == '':
        prev_letter = re.search(r"\((.)\)$", titles[i-1]).group(1)
        new_title = f"{titles[i-1][:-4]} ({chr(ord(prev_letter)   1)})"
        titles[i] = new_title
    else:
        titles[i] = f"{title} (A)"

values = response.xpath("//tr[@class='tr-toggle']/td[@class='nfo']/text()").getall()

results = {}
for title, value in zip(titles, values):
    results[title] = value

When you print the results dictionary you obtain the below

{'4G bands (A)': '1, 2, 3 - A2643',
 '4G bands (B)': '1, 2, 3 - A2484',
 '5G bands (A)': '1, 2, 3 - A2643',
 '5G bands (B)': '1, 2, 3 - A2484',
 '5G bands (C)': '1, 2, 3 - A2641'}