cURL Multi Threading with PHP

36,410

Solution 1

This one always does the job for me... https://github.com/petewarden/ParallelCurl

Solution 2

The above accepted answer is outdated, So, correct answer has to be upvoted.

http://php.net/manual/en/function.curl-multi-init.php

Now, PHP supports fetching multiple URLs at the same time.

Solution 3

https://github.com/krakjoe/pthreads

enter image description here

You may thread in PHP, the code depicted is just horrible thread programming, and I don't advise that is how you do it, but wanted to show you the overhead of 20,000 threads ... it's 18 seconds, on my current hardware which is a Intel G620 ( dual core ) with 8gigs of ram, on server hardware you can expect much faster results ... how you thread such a task is dependant on your resources, and the resources of the service you are requesting ...

Solution 4

Put this at the top of your php script:

set_time_limit(0);
@apache_setenv('no-gzip', 1);//comment this out if you use nginx instead of apache
@ini_set('zlib.output_compression', 0);
@ini_set('implicit_flush', 1);
for ($i = 0; $i < ob_get_level(); $i++) { ob_end_flush(); }
ob_implicit_flush(1);

that would disable all caching the web server or php may be doing, making your output be displayed on the browser while the script is running.

Pay attention to comment out the apache_setenv line if you use nginx web server instead of apache.

Update for nginx:

So OP is using nginx, that makes things a bit trickier as nginx doesn't let to disable gzip compresion from PHP. I also use nginx and I just found out I have it active by default, see:

cat /etc/nginx/nginx.conf | grep gzip
    gzip on;
    gzip_disable "msie6";
    # gzip_vary on;
    # gzip_proxied any;
    # gzip_comp_level 6;
    # gzip_buffers 16 8k;
    # gzip_http_version 1.1;
    # gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;

so you need to disable gzip on nginx.conf and restart nginx:

/etc/init.d/nginx restart

or you can play with the gzip_disable or gzip_types options, to conditionally disable gzip for some browsers or for some page content-types respectively.

Share:
36,410
user1647347
Author by

user1647347

Updated on May 28, 2020

Comments

  • user1647347
    user1647347 almost 4 years

    I'm using cURL to get some rank data for over 20,000 domain names that I've got stored in a database.

    The code I'm using is http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading.

    The array $competeRequests is 20,000 request to compete.com api for website ranks.

    This is an example request: http://apps.compete.com/sites/stackoverflow.com/trended/rank/?apikey=xxxx&start_date=201207&end_date=201208&jsonp=";

    Since there are 20,000 of these requests I want to break them up into chunks so I'm using the following code to accomplish that:

    foreach(array_chunk($competeRequests, 1000) as $requests) {
        foreach($requests as $request) {
            $curl->addSession( $request, $opts );
        }
    
    }
    

    This works great for sending the requests in batches of 1,000 however the script takes too long to execute. I've increased the max_execution_time to over 10 minutes.

    Is there a way to send 1,000 requests from my array then parse the results then output a status update then continue with the next 1,000 until the array is empty? As of now the screen just stays white the entire time the script is executing which can be over 10 minutes.

    • hackattack
      hackattack over 11 years
      do you have to run this script through your server, can't you just run it manually? ... 20,000 requests is a lot of requests, this will most likely have to run in the background
    • user1647347
      user1647347 over 11 years
      I did want to run it as cron job eventually..for now I'm executing it in browser..
    • aziz punjani
      aziz punjani over 11 years
    • user1647347
      user1647347 over 11 years
      I'm already using curl multi... "the code im using is semlabs.co.uk/journal/…. "
  • user1647347
    user1647347 over 11 years
    commented that line out..trying it now but it seems that it's still not outputting any results until the script is done executing each all domains instead of each batch
  • user1647347
    user1647347 over 11 years
    yeah just confirmed that it doesn't output anything until the very end after the script has completed all 20,000
  • user1647347
    user1647347 over 11 years
    do you have any ideas for better more efficient processing? Basically my database has 20,000 domains and queries compete.com using their api to get the rank and then writes the ranks to the database. Ultimately I want to schedule the script to run each month automatically so I was hoping to avoid JavaScript or any other client side scripting.
  • Nelson
    Nelson over 11 years
    See my nginx update above to see if disabling gzip makes it work for your case.
  • zoltar
    zoltar about 10 years
    An important part of the ParallelCurl code is the bugfix for curl_multi_select (bugs.php.net/bug.php?id=63411) for PHP 5.3.18+ starting on line 118 in parallelcurl.php, which probably was the fix the OP was looking for (and anyone else with broken curl_multi scripts)
  • Mani
    Mani almost 8 years
    This answer is outdated and the correct answer is php.net/manual/en/function.curl-multi-init.php
  • specializt
    specializt almost 8 years
    the "programming" itself is fine - its the complete lack of thread scheduling / thread execution queue that makes this script inefficient. In general, a good scheduler should never allow more than insert amount of CPU cores here to run simultaneously, except for machines with HT and threads which have extreme wait times and it should be able to queue at least 2^32 threads
  • Glenn Plas
    Glenn Plas almost 8 years
    The correct answer is the one I gave at the time. There is no need to correct old questions on SO just to keep up with science and development. At the time, that was the answer. You don't know if someone uses old PHP versions or not, it's possible they are. Don't assume everyone runs the latest. If you claim a certain support exists in PHP, you should also mention what version YOU are talking about.
  • NVRM
    NVRM about 6 years
    To anyone, we need to make ajax calls from js to resolve that^ Disabling the caching is a bad idea, say someone open a few thousands http requests, and the server will crash down, as he is going to swap to the limit.