On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.
The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).
That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.
<p>
<strong> Scientific name </strong>
<br>
"Anarta edwarsii"
</p>
So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.
I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.
Thanks!!
CodePudding user response:
You can use next_siblings as resultset, iterate with list comprehension ans join() the results:
' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()
Output:
'"Anarta edwarsii"'
Alternativ example:
Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'https://www.google.com/'
}
r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]
Output:
'"Anarta edwarsii"'
