Running thousands of curl background processes in parallel in bash script

linux performance bash curl wget

62,875

Solution 1

Following the question strict:

mycurl() {
    START=$(date +%s)
    curl -s "http://some_url_here/"$1  > $1.txt
    END=$(date +%s)
    DIFF=$(( $END - $START ))
    echo "It took $DIFF seconds"
}
export -f mycurl

seq 100000 | parallel -j0 mycurl

Shorter if you do not need the boilerplate text around the timings:

seq 100000 | parallel -j0 --joblog log curl -s http://some_url_here/{} ">" {}.txt
cut -f 4 log

If you want to run 1000s in parallel you will hit some limits (such as file handles). Raising ulimit -n or /etc/security/limits.conf may help.

Solution 2

for i in {1..100000}

There are only 65536 ports. Throttle this.

for n in {1..100000..1000}; do   # start 100 fetch loops
        for i in `eval echo {$n..$((n+999))}`; do
                echo "club $i..."
                curl -s "http://some_url_here/"$i  > $i.txt
        done &
        wait
done

(edit: ~~echo~~curl
(edit: strip severely dated assertion about OS limits and add the missing wait)

62,875

zavg

Updated on September 18, 2022

Comments

zavg over 1 year
I am running thounsand of curl background processes in parallel in the following bash script
```
START=$(date +%s)
for i in {1..100000}
do       
    curl -s "http://some_url_here/"$i  > $i.txt&
    END=$(date +%s)
    DIFF=$(( $END - $START ))
    echo "It took $DIFF seconds"
done
```
I have 49Gb Corei7-920 dedicated server (not virtual).

I track memory consumption and CPU through top command and they are far away from bounds.

I am using ps aux | grep curl | wc -l to count the number of current curl processes. This number increases rapidly up to 2-4 thousands and then starts to continuously decrease.

If I add simple parsing through piping curl to awk (curl | awk > output) than curl processes number raise up just to 1-2 thousands and then decreases to 20-30...

Why number of processes decrease so dramatically? Where are the bounds of this architecture?
- Admin over 10 years
  
  You're probably hitting the one of the limits of max running processes or max open sockets. ulimit will show some of those limits.
- Admin over 10 years
  
  I also would suggest using parallel(1) for such tasks: manpages.debian.org/cgi-bin/…
- Admin over 10 years
  
  Try start=$SECONDS and end=$SECONDS - and use lower case or mixed case variable names by habit in order to avoid potential name collision with shell variables. However, you're really only getting the ever-increasing time interval of the starting of each process. You're not getting how long the download took since the process is in the background (and start is only calculated once). In Bash, you can do (( diff = end - start )) dropping the dollar signs and allowing the spacing to be more flexible. Use pgrep if you have it.
- Admin over 10 years
  
  I agree with HBruijn. Notice how your process count is halved when you double the number of processes (by adding awk).
- Admin over 10 years
  
  @zhenech @HBrujin I launched parallel and it says me that I may run just 500 parallel tasks due to system limit of file handles. I raised limit in limits.conf, but now when I try to run 5000 simulaneus jobs it instantly eats all my memory (49 Gb) even before start because every parallel perl script eats 32Mb.
- Admin about 9 years
  
  Original question didn't specify how long a single request takes... what if the earlier instances have completed?
- Admin over 8 years
  
  Why don't you use ab (apache benchmark)? You can set any concurrency.
phemmer over 10 years

Actually the OS can handle this just fine. This is a limitation of TCP. No OS, no matter how special, will be able to get around it. But OP's 4k connections is nowhere near 64k (or the 32k default of some distros)
Guy Avraham over 6 years

And if I wish to run several commands as the one in the short answer version in parallel, how do I do that ?
Ole Tange over 6 years

Quote it: seq 100 | parallel 'echo here is command 1: {}; echo here is command 2: {}'. Spend an hour walking through the tutorial. Your command line will love you for it: man parallel_tutorial