I need to iterate invalid HTML and obtain a text value from all tags to change it.
from bs4 import BeautifulSoup
html_doc = """
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for tag in soup.find_all():
print(tag.name)
if tag.string:
tag.string.replace_with("1")
print(soup)
The result is
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">1</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>
I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.
I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.
CodePudding user response:
.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if
it has no children or more than one child it will return None.
Scenario is not quiet clear to me, but here is one last approach based on your comment:
I need generic code to iterate any html and find all texts so I can work with them.
for tag in soup.find_all(text=True):
tag.replace_with('1')
Example
from bs4 import BeautifulSoup
html_doc = """<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(text=True):
tag.replace_with('1')
Output
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div href="#"></div>1<div >1<h3 id="headline-213-142"><span id="span-225-142">1</span></h3>1</div>1</div><div id="text_block-214-142"><span id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>
