Home > Net >  Python webscraping Javascript with Await
Python webscraping Javascript with Await

Time:02-05

I have a problem concerning webscraping with Python. I'm trying to get the data from the first table from https://www.nyse.com/ipo-center/filings by using from requests_html import AsyncHTMLSession.

My code is here:

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')

Now I have 2 problems with that:

  1. Oftentimes the website doesn't return any valid information from the table1, so I don't get the underlying information that's inside the table. So far I'm circumventing that by simply waiting a couple of seconds, and then run the loop again, until the dataframe is loaded. Probably not the best option though.
  2. The code does work within Jupyter Notebook, but once I upload it in .py format on my Server, I get the error message that SyntaxError: 'await' outside async function.

Does anybody have a solution to the 2 problems mentioned above?

CodePudding user response:

Since you are using coroutines you need to wrap them inside an async function. See below example

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
    r = await session.get(url)
    await r.html.arender(timeout=20)
    return r.text

data = session.run(get_page)

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)
  •  Tags:  
  • Related