I'm trying to scrape a website which returns HTTP403 if JavaScript is not enabled.
The methodology that I'm trying to implement is,
in the parse method, Selenium driver gets the url from response.requets.url and fetch the page
But the issue I'm facing is selenium is automatically closing the request after getting HTTP403 and not entering into the parse method.
Here is my code:
class SampleSpider(scrapy.Spider):
name = "sample_spider"
start_urls = ["https://website_that_returning_403.com"]
def parse(self, response):
bot = webdriver.Chrome()
bot.get(response.request.url)
CodePudding user response:
To handle status other than those in the 200-300 range you use the handle_httpstatus_list spider attribute as below
class SampleSpider(scrapy.Spider):
name = "sample_spider"
handle_httpstatus_list = [403]
start_urls = ["https://website_that_returning_403.com"]
def parse(self, response):
bot = webdriver.Chrome()
bot.get(response.request.url)
Read more about it from the docs
