Home > Software design >  How can I crawl a bunch of links on a root website using Scrapy?
How can I crawl a bunch of links on a root website using Scrapy?

Time:01-11

I am trying to crawl a covid-19 statistics website which has a bunch of links to pages regarding the statistics for different countries. The links all have a class name that makes them easy to access using css selectors ('mt_a'). There is no continuity between the countries so if you are on the webpage for one of them, there is no link to go to the next country. I am a complete beginner to scrapy and I'm not sure what I should do if my goal is to scrape all the (200 ish) links listed on the root page for the same few pieces of information. Any guidance on what I should be trying to do would be appreciated.

The link I'm trying to scrape: https://www.worldometers.info/coronavirus/ (scroll down to see country links)

CodePudding user response:

I think others have already answered the question, but here is the page for Link extractors.

CodePudding user response:

What I would do is create two spiders. One would parse the home page and extract all specific links to country pages href within anchor tags, i.e. href="country/us/" and then create full urls from these relative links so that you get a proper url like https://www.worldometers.info/coronavirus/country/us/.

Then the second spider is given the list of all country urls and then goes on to crawl all individual pages and extract information from those.

For example, you get a list of urls from the first spider:

urls = ['https://www.worldometers.info/coronavirus/country/us/',
'https://www.worldometers.info/coronavirus/country/russia/']

Then in the second spider you give that list to the start_urls attribute.

  •  Tags:  
  • Related