I am trying to scrap the news website with news that are valid of a certain date. The output of the function return :
<li ><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>
How can I only print the Date time only? Printing i.text dont seem to work.
Below is the code.
import requests from bs4 import BeautifulSoup import datetime as datetime from datetime import timedelta import pandas as pd pd.set_option('display.max_columns',None) pd.set_option('max_colwidth',None) def okx_scrap(): b = [] url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements' page = requests.get(url) soup = BeautifulSoup(page.content,'html.parser') small_soup = soup.find_all(class_ = "article-list-link") url_1st = 'https://www.okex.com/support' #Getting Yesterday's Date for i in small_soup: full_url = url_1st (i['href']) page2 = requests.get(full_url) soup2 = BeautifulSoup(page2.content,'html.parser') small_soup2 = soup2.find_all('li', {'class': 'meta-data'}) #print(small_soup2) for i in small_soup2: print(i) okx_scrap()
CodePudding user response:
Considering i as a string (if not typecase the variable i to a string using built in method i = str(i))
i = str(i)
i = i.split("><")[1]
i = i.split("datetime=")[2]
i = i.split("\"")[1]
print(i)
# 2022-01-30T08:56:09Z
CodePudding user response:
you can use regex:
import re
string = '<li ><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>'
datetime= r"(\d{1,4}-\d{1,2}-\d{1,2}T\d{1,2}:\d{1,2}:\d{1,2}Z)"
output = re.findall(datetime, string)
#output:
['2022-01-30T08:56:09Z', '2022-01-30T08:56:09Z']
CodePudding user response:
Don't use find_all but find because there is only one entry in each page and extract time markup and not li:
def okx_scrap():
b = []
url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements'
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
small_soup = soup.find_all(class_ = "article-list-link")
url_1st = 'https://www.okex.com/support'
#Getting Yesterday's Date
for i in small_soup:
full_url = url_1st (i['href'])
page2 = requests.get(full_url)
soup2 = BeautifulSoup(page2.content,'html.parser')
print(soup2.find('time')['datetime'])
Output:
>>> okx_scrap()
2022-01-30T08:56:09Z
2022-01-29T05:41:18Z
2022-01-28T10:15:02Z
2022-01-28T07:29:11Z
2022-01-28T06:45:48Z
2022-01-28T03:13:18Z
...
