I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.
I am trying to scrape some sports data.
In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.
<div >
<div >NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?
Code I have tried so far is:
req = requests.get(url) # str(page) '/')
soup = bs(req.text, 'html.parser')
stype = soup.find('h1', class_ ='type-game')
print(stype)
This code returns "None". I have checked other articles on here and nothing has worked so far.
For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?
For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?
Pseudocode would be something like this:
For webpages 1 to 10000:
if class_='type-game' 'h1' contains "Blackhawks"
then proceed with parsing the code
if not, skip the code and go to the next webpage
I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.
CodePudding user response:
Select your elements more specific for example with css selectors:
soup.select('h1:-soup-contains("Blackhawks")')
or
soup.select('div.type-game h1:-soup-contains("Blackhawks")')
To get the text from a tag just use .text or get_text()
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Example
html='''
<div >
<div >NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
<div >
<div >NHL Regular Season</div>
<h1>Hawks vs. Ducks</h1>
</div>
<div >
<div >NHL Regular Season</div>
<h1>Ducks vs. Blackhawks</h1>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Output
Blackhawks vs. Ducks
Ducks vs. Blackhawks
EDIT
for e in soup.select('div.type-game h1'):
if 'Blackhawks' in e:
pint(e.text)#or do what ever is to do
