Extract Parent Text without Children Text; Parsing HTML-CodePudding

I have this small bit of a soup tag element that I pulled using Selenium & BeautifulSoup.

<footer>
    <p >Environment:
      <span >Desert</span>
    </p>
    <p >Basic Rules
      <span >, pg. 334</span>
    </p>
</footer>

I am trying to grab the Text from just the p elements, but every time I try it grabs the span as well. So far this is what I tried:

for p in Environment.findAll('p'):
    print(p.text)

I have also tried to extract the information using .extract() but that doesn't seem to work for me.

CodePudding user response：

You can use .contents and access the 0th element:

for tag in soup.find_all("p"):
    print(tag.contents[0].strip())

Output:

Environment:
Basic Rules

Or with your attempt, you can remove the <span>'s using .extract() by:

for tag in soup.select("p span"):
    tag.extract()

print(soup.prettify())

Output:

<footer>
 <p >
  Environment:
 </p>
 <p >
  Basic Rules
 </p>
</footer>