Get all text nodes between iterable nodes-CodePudding

How to get text nodes between iterable nodes with python3 and lxml library.
I tried to get all <b> and get texts from each iteration.

Results I want:

[
    ("A1", "Attr1: A1", "Attr2: B1", "Attr3: C1", "D1"),
    ("A2", "Attr1: A2", "Attr2: B2", "Attr3: C2", "D2"),
    ("A3", "Attr1: A3", "Attr2: B3", "Attr3: C3", "D3"),
]

HTML example:

<div>
  <b><a href="">A1</a></b>
  <br/>
  <br/>
  Attr1: A1<br/>
  Attr2: B1<br/>
  Attr3: C1<br/>
  D1<br/>
  <br/><br/><br/>
  <b><a href="">A2</a></b>
  <br/>
  <br/>
  Attr1: A2<br/>
  Attr2: B2<br/>
  Attr3: C2<br/>
  D2<br/>
  <br/><br/><br/>
  <b><a href="">A3</a></b>
  <br/>
  <br/>
  Attr1: A3<br/>
  Attr2: B3<br/>
  Attr3: C3<br/>
  D3<br/>
  <br/><br/><br/>
...
</div>

Code I tried:

from lxml.html import fromstring

with open("filename.html", "r") as f:
    root = fromstring(f.read())
    heads = root.xpath("//b[a[starts-with(., 'A')]]")
    for head in heads:
        for text in head.xpath(
            "./following-sibling::text()[preceding-sibling::b[not(self)]"
        ):
            print(text)

----
[stdout]

      Attr1: A1

      Attr2: B1

      Attr3: C1

      D1

      

      

      

      

      Attr1: A2

      Attr2: B2

      Attr3: C2

      D2

      

      

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

      

    

      

      

      Attr1: A2

      Attr2: B2

      Attr3: C2

      D2

      

      

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

      

    

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

Edited: I think linebreak word can not be a parsing identifier in real html source.

CodePudding user response：

I think you can use bs4.BeautifulSoup here to parse your html data and then use get_text method to get all texts as a single string and then use str.split repeatedly to get the desired outcome:

from bs4 import BeautifulSoup
with open("filename.html", "r") as f:
    html_data = f.read()

soup = BeautifulSoup(html_data)
out = [tuple(s.strip() for s in string.split('\n') if s) 
       for string in soup.get_text().replace('\n\n\n', '\n').split('\n\n') if string]

Output:

[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
 ('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
 ('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]

CodePudding user response：

Using BeautifulSoup you can select all the <b> containing <a> and iterate each of its next_siblings until there is the next <b> - To get rid of the empty strings just use filter():

data = []

for tag in soup.select('b:has(a)'):
    str_list = [tag.text]
    for e in tag.next_siblings:
        if e.name != 'b':
            str_list.append(e.text.strip())
        else:
            break
    data.append(tuple(filter(None, str_list)))

Example

from bs4 import BeautifulSoup

html = '''
<div>
  <b><a href="">A1</a></b>
  <br/>
  <br/>
  Attr1: A1<br/>
  Attr2: B1<br/>
  Attr3: C1<br/>
  D1<br/>
  <br/><br/><br/>
  <b><a href="">A2</a></b>
  <br/>
  <br/>
  Attr1: A2<br/>
  Attr2: B2<br/>
  Attr3: C2<br/>
  D2<br/>
  <br/><br/><br/>
  <b><a href="">A3</a></b>
  <br/>
  <br/>
  Attr1: A3<br/>
  Attr2: B3<br/>
  Attr3: C3<br/>
  D3<br/>
  <br/><br/><br/>
</div>
'''

soup=BeautifulSoup(html,'lxml')

data = []

for tag in soup.select('b:has(a)'):
    str_list = [tag.text]
    for e in tag.next_siblings:
        if e.name != 'b':
            str_list.append(e.text.strip())
        else:
            break
    data.append(tuple(filter(None, str_list)))

data

Output

[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
 ('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
 ('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]

CodePudding user response：

You can achieve with set union feature of xpath:

//b/a/text() | //b/following-sibling::text()

Will output

A1

  

  

  Attr1: A1

  Attr2: B1

  Attr3: C1

  D1

  

  
A2

  

  

  Attr1: A2

  Attr2: B2

  Attr3: C2

  D2

  

  
A3

  

  

  Attr1: A3

  Attr2: B3

  Attr3: C3

  D3

All you have to do is cleanup spaces/reformat the output in your script.

See this live tester example