Home > Software engineering >  How to dynamically change download folder in scrapy?
How to dynamically change download folder in scrapy?

Time:02-05

I am downloading some HTML files from a website using scrapy, but all the downloads are being stored under one folder. I would rather like to store them in different folders dynamically, say HTML files from page 1 go into folder_1 and so on...

this is what my spider looks like

import scrapy

class LearnSpider(scrapy.Spider):
    name = "learn"
    
    start_urls = ["someUrlWithIndexstart=" chr(i) for i in range(ord('a'), ord('z') 1)]

    def parse(self, response):
        for song in response.css('.entity-title'):
            songs = song.css('a ::attr(href)').get()
            yield{
                'file_urls': [songs ".html"]
            }

ideally, what I wanna do is HTMLs scraped from each letter, go into the subfolders of each letter.

Following is my settings file.

BOT_NAME = 'learn'

SPIDER_MODULES = ['learn.spiders']
NEWSPIDER_MODULE = 'learn.spiders'

ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloaded_files'

Any solution/idea will be helpful, thank you.

CodePudding user response:

Create a pipeline:

pipelines.py:

import os
from itemadapter import ItemAdapter
from urllib.parse import unquote
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request


class ProcessPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        urls = ItemAdapter(item).get(self.files_urls_field, [])
        return [Request(u) for u in urls]

    def file_path(self, request, response=None, info=None, *, item=None):
        file_name = os.path.basename(unquote(request.url))
        return item['path']   file_name

Change ITEM_PIPELINES in the settings to this class (ITEM_PIPELINES = {'projectsname.pipelines.ProcessPipeline': 1})

When you yield the item also add the path to the directory you want to download to:

yield {
    'file_urls': [songs ".html"]
    'path': f'folder{page}/'   # ofcourse you'll need to provide the page variable
}
  •  Tags:  
  • Related