Home > Enterprise >  How to get date from inside of span - while not available in beautifulsoup object?
How to get date from inside of span - while not available in beautifulsoup object?

Time:01-04

I have been scraping the https://www.oddsportal.com/ website using bs4 and requests.

In order to scrape out the dates of the matches, I have used the following code.

dates_list = soup_league.tbody.find_all('th',{'class': 'first2 tl'})

This returns a list like,

 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>,
 <th  colspan="3"><span ></span></th>]

Here the date in element is no more available. But the actual code in browser contains the date element inside the span tag as

<span >07 Jan 2022</span>

Why is this behaving like this? Any Solutions?

CodePudding user response:

Note: Always look in your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the development tools.

What happens?

Content is not provided static it is provided dynamically, so with requests you won't handle it that way, it do not support rendering of javascript.

How to fix?

  1. Best way would be to search for an alternativ or use an api.

  2. Use selenium to grab the rendered html and process it with beautifulsoup

  3. If it is only this part you have to achieve, grab the timestamp from class and transform it via datetime

Example

import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='https://www.oddsportal.com/soccer/england/premier-league/'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

for d in soup.select('#tournamentTable span.datet'):
    ts = int(d['class'][-1].split('-')[0][1:])
    print(datetime.utcfromtimestamp(ts).strftime('%d %b %Y'))

Output

11 Jan 2022
12 Jan 2022
14 Jan 2022
15 Jan 2022
16 Jan 2022
21 Jan 2022
22 Jan 2022
23 Jan 2022
  •  Tags:  
  • Related