PHP / Curl: HEAD Request takes a long time on some sites

44,248

Solution 1

Try simplifying it a little bit:

print htmlentities(file_get_contents("http://www.arstechnica.com"));

The above outputs instantly on my webserver. If it doesn't on yours, there's a good chance your web host has some kind of setting in place to throttle these kind of requests.

EDIT:

Since the above happens instantly for you, try setting this curl setting on your original code:

curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);

Using the tool you posted, I noticed that http://www.arstechnica.com has a 301 header sent for any request sent to it. It is possible that cURL is getting this and not following the new Location specified to it, thus causing your script to hang.

SECOND EDIT:

Curiously enough, trying the same code you have above was making my webserver hang too. I replaced this code:

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'

With this:

curl_setopt($ch, CURLOPT_NOBODY, true);

Which is the way the manual recommends you do a HEAD request. It made it work instantly.

Solution 2

You have to remember that HEAD is only a suggestion to the web server. For HEAD to do the right thing it often takes some explicit effort on the part of the admins. If you HEAD a static file Apache (or whatever your webserver is) will often step in an do the right thing. If you HEAD a dynamic page, the default for most setups is to execute the GET path, collect all the results, and just send back the headers without the content. If that application is in a 3 (or more) tier setup, that call could potentially be very expensive and needless for a HEAD context. For instance, on a Java servlet, by default doHead() just calls doGet(). To do something a little smarter for the application the developer would have to explicitly implement doHead() (and more often than not, they will not).

I encountered an app from a fortune 100 company that is used for downloading several hundred megabytes of pricing information. We'd check for updates to that data by executing HEAD requests fairly regularly until the modified date changed. It turns out that this request would actually make back end calls to generate this list every time we made the request which involved gigabytes of data on their back end and xfer it between several internal servers. They weren't terribly happy with us but once we explained the use case they quickly came up with an alternate solution. If they had implemented HEAD, rather than relying on their web server to fake it, it would not have been an issue.

Solution 3

If my memory doesn't fails me doing a HEAD request in CURL changes the HTTP protocol version to 1.0 (which is slow and probably the guilty part here) try changing that to:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); // ADD THIS

$content = curl_exec ($ch);
curl_close ($ch);

Solution 4

I used the below function to find out the redirected URL.

$head = get_headers($url, 1);

The second argument makes it return an array with keys. For e.g. the below will give the Location value.

$head["Location"]

http://php.net/manual/en/function.get-headers.php

Share:
44,248

Related videos on Youtube

Ian
Author by

Ian

Updated on July 09, 2022

Comments

  • Ian
    Ian almost 2 years

    I have simple code that does a head request for a URL and then prints the response headers. I've noticed that on some sites, this can take a long time to complete.

    For example, requesting http://www.arstechnica.com takes about two minutes. I've tried the same request using another web site that does the same basic task, and it comes back immediately. So there must be something I have set incorrectly that's causing this delay.

    Here's the code I have:

    $ch = curl_init();
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
    curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    
    // Only calling the head
    curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
    curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
    
    $content = curl_exec ($ch);
    curl_close ($ch);
    

    Here's a link to the web site that does the same function: http://www.seoconsultants.com/tools/headers.asp

    The code above, at least on my server, takes two minutes to retrieve www.arstechnica.com, but the service at the link above returns it right away.

    What am I missing?

    • Jasen
      Jasen almost 10 years
      what curl is missing is a response body, it doesn't know that HEAD requests only return headers (no body) so it's waiting on for the server to send more data. so curl waits for 2 minutes and then gives up.
  • Ian
    Ian about 15 years
    it appears to be using HTTP 1.1 by default, at least according to the response that i do eventually get: HTTP/1.1 301 Moved Permanently in any case, adding that line has no effect.
  • Ian
    Ian about 15 years
    and i am trying to only retrieve the first page's response headers, and not any further down the line.
  • neofutur
    neofutur almost 12 years
    I had the same problem with same code on a debian server ( not on the gentoo server ) and the fix ( CURLOPT_NOBODY instead of CURLOPT_CUSTOMREQUEST, 'HEAD' + CURLOPT_FOLLOWLOCATION ) worked instantly ! many thanks for this answer, you saved my ass ;) the commit : github.com/neofutur/gwgd/commit/…
  • Synexis
    Synexis almost 8 years
    Note that while this function uses a GET request by default, you can set it to use a HEAD request (to reduce overhead by not retrieving an entire page, if applicable) with PHP's stream_context functions. An example is provided in the manual entry.
  • Vaibs
    Vaibs over 6 years
    file_get_contents or any http request will print in no time if the url actually exist or alive. If the url is deal/does not exist then it will take time to output NULL. That time a timeout parameter needs to be set.
  • quickshiftin
    quickshiftin about 5 years
    Let's hope for the sake of their own benefit, high traffic sites recognize this and implement HEAD as nature intended. Great use case though, thanks for sharing.