How to get text nodes between iterable nodes with python3 and lxml library.
I tried to get all <b> and get texts from each iteration.
Results I want:
[
("A1", "Attr1: A1", "Attr2: B1", "Attr3: C1", "D1"),
("A2", "Attr1: A2", "Attr2: B2", "Attr3: C2", "D2"),
("A3", "Attr1: A3", "Attr2: B3", "Attr3: C3", "D3"),
]
HTML example:
<div>
<b><a href="">A1</a></b>
<br/>
<br/>
Attr1: A1<br/>
Attr2: B1<br/>
Attr3: C1<br/>
D1<br/>
<br/><br/><br/>
<b><a href="">A2</a></b>
<br/>
<br/>
Attr1: A2<br/>
Attr2: B2<br/>
Attr3: C2<br/>
D2<br/>
<br/><br/><br/>
<b><a href="">A3</a></b>
<br/>
<br/>
Attr1: A3<br/>
Attr2: B3<br/>
Attr3: C3<br/>
D3<br/>
<br/><br/><br/>
...
</div>
Code I tried:
from lxml.html import fromstring
with open("filename.html", "r") as f:
root = fromstring(f.read())
heads = root.xpath("//b[a[starts-with(., 'A')]]")
for head in heads:
for text in head.xpath(
"./following-sibling::text()[preceding-sibling::b[not(self)]"
):
print(text)
----
[stdout]
Attr1: A1
Attr2: B1
Attr3: C1
D1
Attr1: A2
Attr2: B2
Attr3: C2
D2
Attr1: A3
Attr2: B3
Attr3: C3
D3
Attr1: A2
Attr2: B2
Attr3: C2
D2
Attr1: A3
Attr2: B3
Attr3: C3
D3
Attr1: A3
Attr2: B3
Attr3: C3
D3
Edited: I think linebreak word can not be a parsing identifier in real html source.
CodePudding user response:
I think you can use bs4.BeautifulSoup here to parse your html data and then use get_text method to get all texts as a single string and then use str.split repeatedly to get the desired outcome:
from bs4 import BeautifulSoup
with open("filename.html", "r") as f:
html_data = f.read()
soup = BeautifulSoup(html_data)
out = [tuple(s.strip() for s in string.split('\n') if s)
for string in soup.get_text().replace('\n\n\n', '\n').split('\n\n') if string]
Output:
[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]
CodePudding user response:
Using BeautifulSoup you can select all the <b> containing <a> and iterate each of its next_siblings until there is the next <b> - To get rid of the empty strings just use filter():
data = []
for tag in soup.select('b:has(a)'):
str_list = [tag.text]
for e in tag.next_siblings:
if e.name != 'b':
str_list.append(e.text.strip())
else:
break
data.append(tuple(filter(None, str_list)))
Example
from bs4 import BeautifulSoup
html = '''
<div>
<b><a href="">A1</a></b>
<br/>
<br/>
Attr1: A1<br/>
Attr2: B1<br/>
Attr3: C1<br/>
D1<br/>
<br/><br/><br/>
<b><a href="">A2</a></b>
<br/>
<br/>
Attr1: A2<br/>
Attr2: B2<br/>
Attr3: C2<br/>
D2<br/>
<br/><br/><br/>
<b><a href="">A3</a></b>
<br/>
<br/>
Attr1: A3<br/>
Attr2: B3<br/>
Attr3: C3<br/>
D3<br/>
<br/><br/><br/>
</div>
'''
soup=BeautifulSoup(html,'lxml')
data = []
for tag in soup.select('b:has(a)'):
str_list = [tag.text]
for e in tag.next_siblings:
if e.name != 'b':
str_list.append(e.text.strip())
else:
break
data.append(tuple(filter(None, str_list)))
data
Output
[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]
CodePudding user response:
You can achieve with set union feature of xpath:
//b/a/text() | //b/following-sibling::text()
Will output
A1
Attr1: A1
Attr2: B1
Attr3: C1
D1
A2
Attr1: A2
Attr2: B2
Attr3: C2
D2
A3
Attr1: A3
Attr2: B3
Attr3: C3
D3
All you have to do is cleanup spaces/reformat the output in your script.
