What is the correct way to set up an https proxy within meta?-CodePudding

I've created a script using scrapy implementing proxies within it to fetch content from a website. The script appears to be working correctly. The site I'm trying to grab data from is https://www.zillow.com/miami-fl-33166/.

Since this is an https site and I'm using https proxies, I've set up a proxy like the following:

request.meta['proxy'] = 'https://123.200.20.242:58847'

However, when I execute the script today after accidentally changing https to http like the following, I could notice that the script still works.

request.meta['proxy'] = 'http://123.200.20.242:58847'

This is how I've implemented proxies within middleware:

def process_request(self, request, spider):
    request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    request.meta['proxy'] = 'https://123.200.20.242:58847'
    # request.meta['proxy'] = 'http://123.200.20.242:58847'

And this is the reference:

DOWNLOADER_MIDDLEWARES = {
    'customized_bot.proxy_middleware.ProxiesMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

What is right way to set up https proxies within meta?

CodePudding user response：

Usage of https proxy is not any different from using http proxy. You simply need to change the proxy address from using http to using https. See this article on zyte.com on how to use https proxy. To summarize, you can:

Pass the proxy via meta object when making a scrapy.Request
Setup a custom scrapy middleware that adds the proxy header to each scrapy Request. Mode details provided at zyte.com

To answer your question, http and https proxy can be used interchangeably to scrape http and https urls.