I've created a script using scrapy implementing proxies within it to fetch content from a website. The script appears to be working correctly. The site I'm trying to grab data from is https://www.zillow.com/miami-fl-33166/.
Since this is an https site and I'm using https proxies, I've set up a proxy like the following:
request.meta['proxy'] = 'https://123.200.20.242:58847'
However, when I execute the script today after accidentally changing https to http like the following, I could notice that the script still works.
request.meta['proxy'] = 'http://123.200.20.242:58847'
This is how I've implemented proxies within middleware:
def process_request(self, request, spider):
request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
request.meta['proxy'] = 'https://123.200.20.242:58847'
# request.meta['proxy'] = 'http://123.200.20.242:58847'
And this is the reference:
DOWNLOADER_MIDDLEWARES = {
'customized_bot.proxy_middleware.ProxiesMiddleware': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
What is right way to set up
httpsproxies within meta?
CodePudding user response:
Usage of https proxy is not any different from using http proxy. You simply need to change the proxy address from using http to using https. See this article on zyte.com on how to use https proxy. To summarize, you can:
- Pass the proxy via
metaobject when making ascrapy.Request - Setup a custom
scrapy middlewarethat adds the proxy header to eachscrapy Request. Mode details provided at zyte.com
To answer your question, http and https proxy can be used interchangeably to scrape http and https urls.
