Home > Back-end >  Extracting string from <h1> element with logic attached
Extracting string from <h1> element with logic attached

Time:01-11

I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.

I am trying to scrape some sports data.

In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.

 <div >
      <div >NHL Regular Season</div>
      <h1>Blackhawks vs. Ducks</h1>
 </div>

With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?

Code I have tried so far is:

 req = requests.get(url) #   str(page)   '/')
 soup = bs(req.text, 'html.parser')

 stype = soup.find('h1', class_ ='type-game')
 print(stype)

This code returns "None". I have checked other articles on here and nothing has worked so far.

For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?

For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?

Pseudocode would be something like this:

 For webpages 1 to 10000:
      if class_='type-game' 'h1' contains "Blackhawks"
           then proceed with parsing the code
      if not, skip the code and go to the next webpage

I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.

CodePudding user response:

Select your elements more specific for example with css selectors:

soup.select('h1:-soup-contains("Blackhawks")')

or

soup.select('div.type-game h1:-soup-contains("Blackhawks")')

To get the text from a tag just use .text or get_text()

for e in soup.select('h1:-soup-contains("Blackhawks")'):
    print(e.text)

Example

html='''
<div >
      <div >NHL Regular Season</div>
      <h1>Blackhawks vs. Ducks</h1>
</div>
<div >
      <div >NHL Regular Season</div>
      <h1>Hawks vs. Ducks</h1>
</div>
<div >
      <div >NHL Regular Season</div>
      <h1>Ducks vs. Blackhawks</h1>
</div>
'''

soup = BeautifulSoup(html,'lxml')

for e in soup.select('h1:-soup-contains("Blackhawks")'):
    print(e.text)

Output

Blackhawks vs. Ducks
Ducks vs. Blackhawks

EDIT

for e in soup.select('div.type-game h1'):
    if 'Blackhawks' in e:
        pint(e.text)#or do what ever is to do
  •  Tags:  
  • Related