Curl PHP http_code says 404 but browser says it's 200

10,273

Solution 1

I found the answer in a comment here http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/comment-page-1/#comment-12186 By setting CURLOPT_NOBODY to true, CURL will use HEAD for the request, which some servers don’t like (for example, forbes) and will return “Emply reply from server”. To fix you need to also set CURLOPT_HTTPGET to reset back to GET request.

/* don’t download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HTTPGET, true); //this is needed to fix the issue

Solution 2

Am not sure how your code looks like but this works fine

$url = "http://www.breakingnews.com";
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0" );
curl_setopt ( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );

curl_exec ( $ch );
var_dump ( curl_getinfo ( $ch ) );
if (curl_errno ( $ch )) {
    print curl_error ( $ch );
} else {
    curl_close ( $ch );
}

Output

array
  'url' => string 'http://www.breakingnews.com' (length=27)
  'content_type' => string 'text/html; charset=utf-8' (length=24)
  'http_code' => int 200
  'header_size' => int 330
  'request_size' => int 154
  'filetime' => int -1
  'ssl_verify_result' => int 0
  'redirect_count' => int 0
  'total_time' => float 4.243
  'namelookup_time' => float 0.171
  'connect_time' => float 0.374
  'pretransfer_time' => float 0.374
  'size_upload' => float 0
  'size_download' => float 68638
  'speed_download' => float 16176
  'speed_upload' => float 0
  'download_content_length' => float -1
  'upload_content_length' => float 0
  'starttransfer_time' => float 3.681
  'redirect_time' => float 0
  'certinfo' => 
    array
      empty
  'redirect_url' => string '' (length=0)
Share:
10,273
Farzher
Author by

Farzher

i.write(code);

Updated on June 08, 2022

Comments

  • Farzher
    Farzher almost 2 years

    I found out why this was happening, check my answer

    This is the only domain that this happens on, I'm running curl_multi on a bunch of URLs, this one comes back with 404 http_code http://www.breakingnews.com

    But when I visit it in the browser it's 200OK (takes a while to load) and doesn't even look like a redirect.

    Anyone know what's up? Is this a common problem

    here's a var_dump:

     ["info"]=> array(22) { ["url"]=> string(27) "http://www.breakingnews.com" ["content_type"]=> string(24) "text/html; charset=utf-8" ["http_code"]=> int(404) ["header_size"]=> int(337) ["request_size"]=> int(128) ["filetime"]=> int(-1) ["ssl_verify_result"]=> int(0) ["redirect_count"]=> int(0) ["total_time"]=> float(1.152229) ["namelookup_time"]=> float(0.001261) ["connect_time"]=> float(0.020121) ["pretransfer_time"]=> float(0.020179) ["size_upload"]=> float(0) ["size_download"]=> float(9755) ["speed_download"]=> float(8466) ["speed_upload"]=> float(0) ["download_content_length"]=> float(-1) ["upload_content_length"]=> float(0) ["starttransfer_time"]=> float(1.133522) ["redirect_time"]=> float(0) ["certinfo"]=> array(0) { } ["redirect_url"]=> string(0) "" } ["error"]=> string(0) ""
    

    UPDATE: This actually looks like a php bug with curl_setopt($ch, CURLOPT_NOBODY, true); https://bugs.php.net/bug.php?id=39611

    EDIT: It's not a bug.

    • Andrea
      Andrea about 12 years
      Check you didn't make a spelling error.
    • Marc B
      Marc B about 12 years
      Some sites have anti-scraping defences and may return a 404 if they detect a scraper (e.g. curl's user agent). Try your code again and have curl fake a Firefox (or other real browser's) user agent.
    • Brad
      Brad about 12 years
      I just did some tests on http://www.breakingnews.com, and they don't check user-agent. Or at least... they don't care when the user-agent isn't set.
    • Farzher
      Farzher about 12 years
      useragent was a good idea, it didn't work though. Maybe someone else could CURL it and see if they get 404 too? (:
    • John
      John about 12 years
      I did what @Brad did.. using Jmeter. Got a 200 OK back.
    • John
      John about 12 years
      Interesting: just hit it from the command line with a head request and got a 404. curl -i -X HEAD http://www.breakingnews.com/ EDIT: but the following works fine.. curl http://www.breakingnews.com
  • Farzher
    Farzher about 12 years
    Why does it only 404 when I do this? curl_setopt($curl_arr[$i], CURLOPT_NOBODY, true);