I'm trying to scrape a list of zpids from this webpage using the requests module. The zpids are available within a list right next to searchListZpids in the page source (ctrl u). They are 40 in number.
The script below can fetch the zpids errorlessly. However, the problem is the list the script produces are different from the ones available on that webpage. Some of the zpids in the list I received have exact matchings with those available on that page.
Sometimes the list I get is accurate but most of the time they are different.
The script that I'm using:
import re
import requests
link = 'https://www.zillow.com/ct/9_p/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
res = requests.get(link,headers=headers)
zpids = re.findall(r"searchListZpids[\s\S] ?\[(.*?)\]",res.text)[0]
print(zpids)
Output I get at this moment:
57912175, 177202011, 57838346, 57702376, 2083150985, 2091636205, 59028017, 2066602375, 57843835, 2066598335, 58845027, 58904562, 58118011, 58838731, 57930222, 2066611590, 59977275, 197747278, 57932219, 57893209, 58775017, 2066600444, 2066601022, 58059157, 177275234, 58819070, 59297439, 58859881, 2078457589, 58775318, 57790587, 57689409, 2066601997, 57394605, 177286302, 58133143, 59068957, 58096934, 240506947, 83121293
How can I scrape the exact list of zpids from that webpage using requests?
CodePudding user response:
I have run your code several times and not found a mismatch once.
t.py file
import re
import requests
link = 'https://www.zillow.com/ct/9_p/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
res = requests.get(link,headers=headers)
zpids = re.findall(r"searchListZpids[\s\S] ?\[(.*?)\]",res.text)[0]
print(zpids)
with open("html.txt","w") as f:
f.write(res.text)
f.write("\n")
in terminal
date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
output
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:11 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:13 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:15 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:17 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:19 EST 2022
CodePudding user response:
You are doing all right and you are mistaken thinking that you are getting an incorrect list of zpids.
This list of zpids is a list of agent listings that are displayed on the current page (in your case 9th page, because you are using the 9_p route in your URL).
In fact, you have more than 5000 agent listings according to your request and you are even not specifying the order of these agent listings, so they can differ from request to request (and you should see it in your browser too).
You can try to set sorting in your request. For example, this URL shows agent listings sorted by price from low to high. But it is not the full solution to your problem too, because the full list of objects can always change on the source website.
