Home > Mobile >  Why do I scrape more reviews than the indeed website shown
Why do I scrape more reviews than the indeed website shown

Time:01-28

I have a question about scraping employee reviews from Indeed for Meta. I am just looking at reviews from employees in the United States, and the indeed website shows that there are 467 reviews from employees in the US as the figure shown below, but I have obtained 490 reviews from the web scrape. Why are reviews from web-scraping more than those shown on the website? How can I address this issue?

This is the link:

enter image description here

# Load the Modules
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np
import pandas as pd

lst=[]
for i in range(0, 480, 20):
    print(i)
    url = (f'https://www.indeed.com/cmp/Meta-dd1502f2/reviews?start={i}')
    header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
    page = requests.get(url, headers = header)
    soup = BeautifulSoup(page.content, 'lxml')
    main_data = soup.find_all("div",attrs={"data-tn-section":"reviews"})
    for data in main_data:
        try:
            date=data.find("span",attrs={"itemprop":"author"}).get_text(strip=True).split("-")[2]
        except AttributeError:
            date=np.nan
        try:
            title=data.find("h2").get_text(strip=True)
        except AttributeError:
            title=np.nan
        try:
            status=data.find("span",attrs={"itemprop":"author"}).get_text(strip=True).split("-")[0]
        except AttributeError:
            status=np.nan
            
        try:
            location=data.find("span",attrs={"itemprop":"author"}).get_text(strip=True).split("-")[1]
        except AttributeError:
            location=np.nan
        try:
            review=data.find("span",attrs={"itemprop":"reviewBody"}).get_text(strip=True)
        except AttributeError:
            review=np.nan
            
        try:
            pros=data.find('h2',class_='css-6pbru9 e1tiznh50').next_sibling.get_text(strip=True)           
        except:
            pros=np.nan
        try:
            cons=data.find('h2',class_='css-cvf89l e1tiznh50').next_sibling.get_text(strip=True)
        except:
            cons=np.nan
            
        try:
            rating=data.find("div",attrs={"itemprop":"reviewRating"}).find("button")['aria-label'].split(" ")[0]
        except AttributeError:
            rating=np.nan
        lst.append([date, title, status, location, review, pros, cons, rating])

df_meta=pd.DataFrame(data=lst,columns=['date', 'title', 'status', 'location', 'review', 'pros', 'cons', 'rating'])
df_meta

enter image description here

CodePudding user response:

Indeed returns 21 results on each page. Your paging is assuming that there are 20 results on page.

However, the number of pages (24) seems to indicate there are 20 results on each page.

20 * 23   7 (on the last page) = 467

What actually happens is that each page returns 21 results:

21 * 23   7 = 490

which gives your number.

It looks like Indeed has a bug that returns incorrect number of reviews when you order by date. I did not see the same problem when ordering is based on rating.

  •  Tags:  
  • Related