Parallel download using Curl command line utility
Solution 1
Well, curl
is just a simple UNIX process. You can have as many of these curl
processes running in parallel and sending their outputs to different files.
curl
can use the filename part of the URL to generate the local file. Just use the -O
option (man curl
for details).
You could use something like the following
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate
Solution 2
My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs
, which is capable of running a specified number of commands in subprocesses.
The one-liner I would use is, simply:
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'
This warrants some explanation. The use of -n 1
instructs xargs
to process a single input argument at a time. In this example, the numbers 1 ... 10
are each processed separately. And -P 2
tells xargs
to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.
You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs
is.
Update: I suspect that my example with xargs
could be improved (at least on Mac OS X and BSD with the -J
flag). With GNU Parallel, the command is a bit less unwieldy as well:
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
Solution 3
As of 7.66.0, the curl
utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs
and background spawning, in most cases:
curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'
This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/
Solution 4
Curl can also accelerate a download of a file by splitting it into parts:
$ man curl |grep -A2 '\--range'
-r/--range <range>
(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial docu-
ment) from a HTTP/1.1, FTP or SFTP server or a local FILE.
Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl
Solution 5
For launching of parallel commands, why not use the venerable make
command line utility.. It supports parallell execution and dependency tracking and whatnot.
How? In the directory where you are downloading the files, create a new file called Makefile
with the following contents:
# which page numbers to fetch
numbers := $(shell seq 1 10)
# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))
# the rule which tells how to generate a %.html dependency
# $@ is the target filename e.g. 1.html
%.html:
curl -C - 'http://www...../?page='$(patsubst %.html,%,$@) -o [email protected]
mv [email protected] $@
NOTE The last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.
Now you just run:
make -k -j 5
The curl command I used will store the output in 1.html.tmp
and only if the curl command succeeds then it will be renamed to 1.html
(by the mv
command on the next line). Thus if some download should fail, you can just re-run the same make
command and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".
(The -k
switch tells make to keep downloading the rest of the files even if one single download should fail.)
Ravi Gupta
Updated on April 08, 2021Comments
-
Ravi Gupta about 3 years
I want to download some pages from a website and I did it successfully using
curl
but I was wondering if somehowcurl
downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it incurl
command line utility?The current command I am using is
curl 'http://www...../?page=[1-10]' 2>&1 > 1.html
Here I am downloading pages from 1 to 10 and storing them in a file named
1.html
.Also, is it possible for
curl
to write output of each URL to separate file sayURL.html
, whereURL
is the actual URL of the page under process.