Home > Software design >  parallel download of 7000 files
parallel download of 7000 files

Time:02-02

Please would you advise about an effective method to download a large number of files from EBI : https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/tree/master/tabix

We can use wget sequentially on each file. I have seen some information about using a python script : How to parallelize file downloads?

although there might be some complementary ways by using bash script or R ?

CodePudding user response:

If you are not requiring R here, then the xargs command-line utility allows parallel execution. (I'm using the linux version in the findutils set of utilities. I believe this is also supported in the version of wget in git-bash. I don't know if the macos binary is installed by default nor if it includes this option, ymmv.)

For proof, I'll create a mywget script that prints the start time (and args) and then passes all arguments to wget.

(mywget)

echo "$(date) :: ${@}"
wget "${@}"

I also have a text file urllist with one URL per line (it's crafted so that I don't have to encode anything or worry about spaces, etc). (Because I'm using a personal remote server to benchmark this, and I don't that the slashdot-effect, I'll obfuscate the URLs here ...)

(urllist)

https://somedomain.com/quux0
https://somedomain.com/quux1
https://somedomain.com/quux2

First, no parallelization, simply consecutive (default). (The -a urllist is to read items from the file urllist instead of stdin. The -q is to be quiet, not required but certainly very helpful when doing things in parallel, since the typical verbose option has progress bars that will overlap each other.)

$ time xargs -a urllist ./mywget -q
Tue Feb  1 17:27:01 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:10 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:12 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.375s
user    0m0.210s
sys     0m0.958s

Second, adding -P 3 so that I run up to 3 simultaneous processes. The -n1 is required so that each call to ./mywget gets only one URL. You can adjust this if you want a single call to download multiple files consecutively.

$ time xargs -n1 -P3 -a urllist ./mywget -q
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.088s
user    0m0.272s
sys     0m1.664s

In this case, as BenBolker suggested in a comment, parallel download saved me nothing, it still took 13 seconds. However, you can see that in the first block, they started sequentially with 9 seconds and 2 seconds in between each of the three downloads. (We can infer that the first file is much larger, taking 9 seconds, and the second file took about 2 seconds.) In the second block, all three started at the same time.

(Side note: this doesn't require a shell script at all; you can use R's system or the processx::run functions to call xargs -n1 -P3 wget -q with a text file of URLs that you create in R. So you can still to this comfortably from the warmth of your R console.)

CodePudding user response:

I had a similar task and my approach was the following: I have used python, redis and supervisord.

  1. I have pushed to a redis list all the paths/urls of the files i needed (i just created a small py script to read my csv and push it to a Redis queue/list.)
  2. then i have created another py script to read (pull) one item from the redis list and download it.
  3. using supervisord, i just launched 10 paralel py files that were pulling data from redis (file paths) and downloading the files.

It might be too complicated for you, but this solution is very scalable, can use multiple servers etc.

  •  Tags:  
  • Related