I am trying to get all reviews of a movie from here: https://www.rottentomatoes.com/m/interstellar_2014/reviews. But as you see on the web page they only show about 19 reviews. So I am unable to get all reviews my code bellow only prints the 19 first reviews.
## First we import the module necessary to open URLs (basically websites)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
def scrapUrl(URL):
""" scrap data from url - give url as a parameter """
page = urlopen(URL)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
#print(HTML)
soup = BeautifulSoup(html, "html.parser")
return soup
def findReviews(soup):
""" find reviews using """
NoneType = type(None)
reviews = []
for element in soup.find_all("div"):
i = element.get("class")
if isinstance(i, NoneType) == False:
if 'the_review' in i:
reviews.append(element.text)
dfrev = pd.DataFrame(reviews, columns= ['reviews'])
return dfrev
url = "https://www.rottentomatoes.com/m/interstellar_2014/reviews"
sc = scrapUrl(URL)
t = findReviews(sc)
print(t)
CodePudding user response:
You can do this without BeautifulSoup, as rottentomatoes retrieves the reviews from an api. So you could first extract the movie id from the url with regex, then loop api requests until the last page and load the data with pandas:
import pandas as pd
import requests
import re
headers = {
'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
s = requests.Session()
def get_reviews(url):
r = requests.get(url)
movie_id = re.findall(r'(?<=movieId":")(.*)(?=","type)',r.text)[0]
api_url = f"https://www.rottentomatoes.com/napi/movie/{movie_id}/criticsReviews/all" #use reviews/userfor user reviews
payload = {
'direction': 'next',
'endCursor': '',
'startCursor': '',
}
review_data = []
while True:
r = s.get(api_url, headers=headers, params=payload)
data = r.json()
if not data['pageInfo']['hasNextPage']:
break
payload['endCursor'] = data['pageInfo']['endCursor']
payload['startCursor'] = data['pageInfo']['startCursor'] if data['pageInfo'].get('startCursor') else ''
review_data.extend(data['reviews'])
time.sleep(1)
return review_data
data = get_reviews('https://www.rottentomatoes.com/m/interstellar_2014/reviews')
df = pd.json_normalize(data)
| creationDate | isFresh | isRotten | isRtUrl | isTop | reviewUrl | quote | reviewId | scoreOri | scoreSentiment | critic.name | critic.criticPictureUrl | critic.vanity | publication.id | publication.name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Oct 9, 2021 | True | False | False | False | https://www.nerdophiles.com/2014/11/05/interstellar-delivers-beauty-and-complexity-in-typical-nolan-fashion/ | The inherent message of the film brings hope, but it can definitely get waterlogged by intellectual speak and long-winded scenes. | 2830324 | 3/5 | POSITIVE | Therese Lacson | http://resizing.flixster.com/gGcp41zlZQ3sYdSbQoS8AATHp8Y=/128x128/v1.YzszODg1O2o7MTg5OTA7MjA0ODszMDA7MzAw | therese-lacson | 3888 | Nerdophiles |
| 1 | Aug 10, 2021 | True | False | False | False | https://www.centraltrack.com/space-oddity/ | The film is indeed a sight to behold -- and one that demands to be seen on the biggest possible screen. | 2812665 | B | POSITIVE | Kip Mooney | http://resizing.flixster.com/hoYjdO_o-Ip21XnJaWr0C27-nbc=/128x128/v1.YzszOTk2O2o7MTg5OTA7MjA0ODs0MDA7NDAw | kip-mooney | 2577 | Central Track |
| 2 | Feb 2, 2021 | True | False | False | False | http://www.richardcrouse.ca/interstellar-3-stars-one-for-each-hour-of-the-movie-sentimental-sic-fi/ | Nolan reaches for the stars with beautifully composed shots and some mind-bending special effects, but the dime store philosophy of the story never achieves lift off. | 2763105 | 3/5 | POSITIVE | Richard Crouse | http://resizing.flixster.com/Ep5q7RwWq9Ud5KBhnha2sPnsRD0=/128x128/v1.YzszODgxO2o7MTg5OTA7MjA0ODszMDA7MzAw | richard-crouse | 3900 | Richard Crouse |
