I’m making a distributed web crawler with Python and Celery.
I have a master script that get all URLs from the queue in the database, then sends the URLs as tasks to celery workers. When the worker has crawled the page, it sends to the queue. The master script then loops and repeats the process.
I’m having difficulty keeping tack of a “master list”. A list that contains all crawled URLs, so no URL is crawled twice.
Initially, I tried a table for all the crawled links and then a list inside the master. So the master would check the URL from queue isn’t in the list, then proceeds. This works pretty well until around 10k URLs have been crawled. Then the process of querying and reading the database becomes slow. Almost all of the time is waiting for the read or write of the database, instead of crawling.
I’m using Postgresql. I’m wondering if I have it poorly configured (using it straight out of installation, with no settings changed)
I’ve also tried using the UNIQUE constraint for the URL field in the database, to skip the need of checking if it’s been crawled. However, with so many concurrent crawlers running, it constantly hit “IntegrityErrors”.
I’m sure there’s a really obvious way to this problem and that I’ve been over thinking it. Or a better way to structure the database so it’s not slow.
CodePudding user response:
