Home > Software design >  How to scrape dates of News Site
How to scrape dates of News Site

Time:02-06

I am trying to scrap the news website with news that are valid of a certain date. The output of the function return :

<li ><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>

How can I only print the Date time only? Printing i.text dont seem to work.

Below is the code.

import requests
from bs4 import BeautifulSoup
import datetime as datetime
from datetime import timedelta
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('max_colwidth',None)

def okx_scrap():

    b = []
    url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements'
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    small_soup = soup.find_all(class_ = "article-list-link")
    url_1st = 'https://www.okex.com/support'

        #Getting Yesterday's Date

    


    for i in small_soup:
        full_url = url_1st  (i['href'])
        page2 = requests.get(full_url)
        soup2 = BeautifulSoup(page2.content,'html.parser')
        small_soup2 = soup2.find_all('li', {'class': 'meta-data'})
        #print(small_soup2)
        for i in small_soup2:
            print(i)

            

   

okx_scrap()

CodePudding user response:

Considering i as a string (if not typecase the variable i to a string using built in method i = str(i))

i = str(i)
i = i.split("><")[1]
i = i.split("datetime=")[2]
i = i.split("\"")[1]

print(i)
# 2022-01-30T08:56:09Z


CodePudding user response:

you can use regex:

import re

string = '<li ><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>'

datetime= r"(\d{1,4}-\d{1,2}-\d{1,2}T\d{1,2}:\d{1,2}:\d{1,2}Z)"

output = re.findall(datetime, string)

#output:

['2022-01-30T08:56:09Z', '2022-01-30T08:56:09Z']

CodePudding user response:

Don't use find_all but find because there is only one entry in each page and extract time markup and not li:

def okx_scrap():

    b = []
    url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements'
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    small_soup = soup.find_all(class_ = "article-list-link")
    url_1st = 'https://www.okex.com/support'

        #Getting Yesterday's Date

    for i in small_soup:
        full_url = url_1st  (i['href'])
        page2 = requests.get(full_url)
        soup2 = BeautifulSoup(page2.content,'html.parser')
        print(soup2.find('time')['datetime'])

Output:

>>> okx_scrap()
2022-01-30T08:56:09Z
2022-01-29T05:41:18Z
2022-01-28T10:15:02Z
2022-01-28T07:29:11Z
2022-01-28T06:45:48Z
2022-01-28T03:13:18Z
...
  •  Tags:  
  • Related