Home > Software engineering >  I am unable to scrape domain name from this website? Postman returns json() but requests through exc
I am unable to scrape domain name from this website? Postman returns json() but requests through exc

Time:02-02

I want to scrape domain name and social links (linkedin, twitter) emails from the the following website. https://cloud28plus.com/en/partner/resecurity--inc- I tried to fetch data from Network Request first. it did not work. then I tried requests module. It is throwing an exception when I try this:

response = requests.get(url)
data = response.json() # not working.

Then I tried BeautifulSoup. when I print soup.body, it returns data. but it is not structured, hence soup object returns empty list [], when I call soup.find_all('a'). My code is

import requests
from bs4 import BeautifulSoup
url = 'https://cloud28plus.com/en/partner/resecurity--inc-'
response = requests.get(url)
# data = response.json() # not working
page = response.text
soup = BeautifulSoup(page, 'html.parser')
# Returns Empty list
soup.find_all('a')

soup.find('a', class_ = 'followUs__IconTwitter-sc-1gwf1fm-2 edzSJr fa fa-twitter-square')  # returns nothing
soup.find_all('div', class_ = 'col'). # empty list

can anybody tell what am I doing wrong?

CodePudding user response:

The data you see on the page is stored inside embedded Json. To parse it, you can use next example:

import json
import requests
from bs4 import BeautifulSoup

url = "https://cloud28plus.com/en/partner/resecurity--inc-"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])

# uncomment this to see all data:
# print(json.dumps(data, indent=4))

print(data["props"]["initialProps"]["pageProps"]["element"]["twitter"])

Prints:

https://twitter.com/RESecurity
  •  Tags:  
  • Related