PHP Parallel curl requests

34,987

Solution 1

If you mean multi-curl then, something like this might help:


$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);


for($i = 0; $i < $node_count; $i++)
{
    $results[] = curl_multi_getcontent  ( $curl_arr[$i]  );
}
print_r($results);

Hope it helps in some way

Solution 2

i don't particularly like the approach of any of the existing answers

Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason

Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading

here's an approach with neither of those issues:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
    $worker = curl_init($website);
    curl_setopt_array($worker, [
        CURLOPT_RETURNTRANSFER => 1
    ]);
    curl_multi_add_handle($mh, $worker);
}
for (;;) {
    $still_running = null;
    do {
        $err = curl_multi_exec($mh, $still_running);
    } while ($err === CURLM_CALL_MULTI_PERFORM);
    if ($err !== CURLM_OK) {
        // handle curl multi error?
    }
    if ($still_running < 1) {
        // all downloads completed
        break;
    }
    // some haven't finished downloading, sleep until more data arrives:
    curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
    if ($info["result"] !== CURLE_OK) {
        // handle download error?
    }
    $results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
    curl_multi_remove_handle($mh, $info["handle"]);
    curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);

note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (! is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); // ?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            do {
                $err = curl_multi_exec($mh, $still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($still_running < count($workers)) {
                // some workers finished, fetch their response and close them
                break;
            }
            $cms = curl_multi_select($mh, 1);
            // var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            // echo "NOT FALSE!";
            // var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLE_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $info['result'],
                            "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
                        ), true);
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $err,
                            "curl error " . $err . ": " . curl_strerror($err)
                        ), true);
                    }
                } else {
                    $ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int) $info['handle']]));
                unset($workers[(int) $info['handle']]);
                curl_close($info['handle']);
            }
        }
        // echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            // echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (! $neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(
                    false,
                    - 1,
                    "curl_init() failed"
                );
            }
            continue;
        }
        $workers[(int) $neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        // curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        // echo "WAITING FOR WORKERS TO BECOME 0!";
        // var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

that will download the entire list and not download more than 50 urls simultaneously (but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)

Share:
34,987
user1205408
Author by

user1205408

Updated on February 04, 2021

Comments

  • user1205408
    user1205408 over 3 years

    I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).

    Since I am using file_get_contents($url). I wrote a simple script, is it:

    $websites = array(
        $url1,
        $url2,
        $url3,
         ...
        $url15
    );
    
    foreach ($websites as $website) {
        $data[] = file_get_contents($website);
    }
    

    and it was proven to be very slow, because it waits for the first request and then do the next one.

  • user1205408
    user1205408 over 12 years
    Trying it now... :). I will let you know if it will work, Thank you so much.
  • Theodore R. Smith
    Theodore R. Smith about 12 years
    Oh, this happens to me all the time! Or they vote up the answer and don't accept it, or accept it but don't vote it up. Frustrating.
  • ramya br
    ramya br over 8 years
    may i know what $running contains?
  • Shlizer
    Shlizer about 8 years
    @ramyabr boolean (reference) if multicurl is still running and getting data.
  • hanshenrik
    hanshenrik almost 4 years
    your multi_exec() loop makes no sense and will always exit on the first row... if you absolutely insist on supporting CURLM_CALL_MULTI_PERFORM (which was deprecated from curl since at least 2012 and not used anymore), the the loop should be like: for (;;) { do { $ex = curl_multi_exec($mh, $still_running); } while ($ex === CURLM_CALL_MULTI_PERFORM); if ($ex !== CURLM_OK) { /*handle curl error?*/ } if ($still_running < 1) { break; } curl_multi_select($mh, 1); }
  • hanshenrik
    hanshenrik almost 4 years
    your multi_exec loop will work, but it will also waste a shitton of cpu, using 100% CPU (of 1 core) until everything has been downloaded, because your loop is spamming curl_multi_exec(), an async function, as fast as possible, until everything is downloaded. if you change it to do {curl_multi_exec($master,$running);if($running>0){curl_multi‌​_select($mh,1);}} while($running > 0); then it will use ~1% cpu instead of 100% cpu (a better loop can still be constructed though, this would be even better for(;;){curl_multi_exec($mh,$running);if($running<1)break;cu‌​rl_multi_select($mh,‌​1);}
  • hanshenrik
    hanshenrik almost 4 years
    your code is handling CURLM_CALL_MULTI_PERFORM (hence CCMP) wrong, you're not supposed to run select() if you get CCMP, you're supposed to call multi_exec() again if you get CCMP, but worse, as of (2012ish?) curl never returns CCMP anymore, so your $state === CCMP check will always fail, meaning your exec loop will always exit after the first iteration
  • hanshenrik
    hanshenrik almost 4 years
    @DivyeshPrajapati it works great until you check how much CPU it's consuming, see my comment above ^^
  • hanshenrik
    hanshenrik almost 4 years
    @Shlizer that's incorrect, $running contains an int, the number of curl handles who still hasn't finished downloading the entire response (it's safe to use the variable as if it was a bool, though, because int(0)==false and int(>=1)==true , but the variable itself is int, not bool, and it can contain any number >= 0, like int(5) )
  • Timo Huovinen
    Timo Huovinen over 3 years
    My original reasoning was to add it as backwards compatibility for older versions of curl (pre 2012) and it's ok if it just exists the loop immediately. That's also why I packaged it into curl_multi_exec_full, which can be renamed to curl_multi_exec for post 2012 compatibility. CCMP will select and exec again. I really do appreciate your comment and would like some more reasoning why the code is wrong, right now I'm not seeing the error.
  • hanshenrik
    hanshenrik over 3 years
    for one: you run select() if you get CCMP, that's wrong. you're not supposed to wait for more data to arrive if you get CCMP. it means you're immediately supposed to run curl_multi_exec() if you get CCMP (it allows for programs that needs very low latency/realtime-systems to do other stuff if a single multi_exec() used too much cpu/time, but so many people didn't understand how to use it correctly that the curl devs decided to deprecate it: too many got it wrong, and very few people actually needed it. on the curl mailing list there was only 1 person that complained and actually used it)
  • hanshenrik
    hanshenrik over 3 years
    two: you never run select() if you don't get CCMP, but that's also wrong, sometimes (in these days, OFTEN) you're supposed to run select() even if you don't get CCMP, but your code doesn't.
  • hanshenrik
    hanshenrik over 3 years
    here is how i think the function should look like: 3v4l.org/1iaqm
  • Timo Huovinen
    Timo Huovinen over 3 years
    @hanshenrik When I read the documentation (I don't remember where it is) it said that select didn't do anything besides adding wait time while CCMP, which was actually required for Windows, otherwise it would hit the 100% cpu mark on old Curls, so if I remove the select I would be breaking it for pre 2012 curl on windows. I do run select, it's inside the curl_multi_wait function, notice that it counts process completion one process at a time lower down the code, meaning that we don't care that curl_multi_exec_full just finished in one loop or runs select, which it won't on new curl
  • Timo Huovinen
    Timo Huovinen over 3 years
    @hanshenrik do { $state = curl_multi_exec($mh, $still_running); } while ($state === CURLM_CALL_MULTI_PERFORM); hits 100% cpu and I'm pretty sure is a bug (especially on windows), the select is basically a timeout to prevent 100% cpu. Remember that do will hit 100% cpu unless there's a sleep in there.
  • Timo Huovinen
    Timo Huovinen over 3 years
    @hanshenrik also notice that I'm capturing the response as it completes (unlike your example), not when all of them have been completed, allowing me to do manual redirects with minimal time loss. I can even inject additional requests after multi has been started.
  • hanshenrik
    hanshenrik over 3 years
    that loop hits 100% cpu when there's more data downloaded and ready to be fetched, as it should do. funny thing about this script: 3v4l.org/eaHCl if you run it on MS Windows's cmd.exe (on a fast 50mbit connection, at least) it will actually use 100% cpu, from CMD.exe - cmd is very slow at receiving null bytes, and it's being bombarded with 50mbit worth of null bytes every second. but if you run it on a cygwin terminal, or a linux termminal, or if you run it as php foo.php > NUL (NUL is windows's /dev/null, but in Windows it's in every folder), it uses ~1% CPU , try it yourself :P
  • Timo Huovinen
    Timo Huovinen over 3 years
    @hanshenrik Interesting, I did not know that. To clarify, I added the curl_multi_select into CCMP to cope with an old windows bug, where the select acts as a kind of sleep. I'm a bit worried that removing it will make it less "robust", but I'm ok with that.
  • Timo Huovinen
    Timo Huovinen over 3 years
    @hanshenrik what's the harm in keeping curl_multi_select for CCMP in curl_multi_exec_full?
  • hanshenrik
    hanshenrik over 3 years
    CCMP means "more data is ready to be read now, you should run read() now, it will not block" - and then your code proceed to.. run select() (instead of read()) and wait for even more data to arrive, instead of read()ing - if the next data comes slowly, or if some buffers are full and waiting to be read, i'm assuming it can slow down the code (waiting on select() when you should be read()'ing )
  • Divyesh Prajapati
    Divyesh Prajapati over 3 years
    @hanshenrik didn't checked that but it definitely reduce request time... I was having 10 request simultaneously and each taking 3 seconds so in total it was taking around 25-30 seconds but after using this time reduced to 5-8 seconds
  • Ali Niaz
    Ali Niaz over 3 years
    Could you please tell what does $return_fault_reason mount to?
  • hanshenrik
    hanshenrik over 3 years
    @AliNiaz sorry forgot about that when copying the code from this answer, $return_fault_reason is supposed to be an argument telling if a failed download should just be ignored, or if a failed download should come with an error message; i updated the code with the $return_fault_reason argument now.