I need to get the text from first <td> element of each <tr>. But not all the text, only the one inside tags <a> and outside of any other tag. I wrote examples of necessary text as "yyy"/"y" and examples of not necessary as "zzz"
<table>
<tbody>
<tr>
<td>
<b>zzz</b>
<a href="#">yyy</a>
"y"
<a href="#">yyy</a>
<sup>zzz</sup>
<a href="#">yyy</a>
<a href="#">yyy</a>
"y"
</td>
<td>
zzzzz
</td>
</tr>
</tbody>
</table>
Here what I have at the moment
words = []
for tableRows in soup.select("table > tbody > tr"):
tableData = tableRows.find("td").text
text = [word.strip() for word in tableData.split(' ')]
words.append(text)
print(words)
But this code is parsing all the text from <td>: ["zzz", "yyyy", "yyyy", "zzz", "yyyy"].
CodePudding user response:
Try:
from bs4 import BeautifulSoup, Tag, NavigableString
html_doc = """\
<table>
<tbody>
<tr>
<td>
<b>zzz</b>
<a href="#">yyy</a>
"y"
<a href="#">yyy</a>
<sup>zzz</sup>
<a href="#">yyy</a>
<a href="#">yyy</a>
"y"
</td>
<td>
zzzzz
</td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(html_doc, "html.parser")
for td in soup.select("td:nth-of-type(1)"):
for c in td.contents:
if isinstance(c, Tag) and c.name == "a":
print(c.text.strip())
elif isinstance(c, NavigableString):
c = c.strip()
if c:
print(c)
Prints:
yyy
"y"
yyy
yyy
yyy
"y"
soup.select("td:nth-of-type(1)")selects just first<td>.- then we iterate over
.contentsof this<td> if isinstance(c, Tag) and c.name == "a"checks if the content isTagand the name of theTagis<a>if isinstance(c, NavigableString)checks if the content is plain string.
CodePudding user response:
Based on your example, use the children of td tag.
Then check child having name a of None.
Then check if child having text then append.
words = []
for item in soup.select("table > tbody > tr"):
for child in item.td.children:
if child.name=='a' or child.name==None:
if child.text.strip():
words.append(child.text.strip())
print(words)
Output:
['yyy', '"y"', 'yyy', 'yyy', 'yyy', '"y"']
