bs4 find sub-text / find_next-CodePudding

On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.

The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).

That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.

<p>
<strong> Scientific name </strong>

<br>

"Anarta edwarsii"

</p>

So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.

I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.

Thanks!!

CodePudding user response：

You can use next_siblings as resultset, iterate with list comprehension ans join() the results:

' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()

Output:

'"Anarta edwarsii"'

Alternativ example:

Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:

import requests
from bs4 import BeautifulSoup

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'Referer': 'https://www.google.com/'
}

r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]

Output:

'"Anarta edwarsii"'