file_get_contents script works with some websites but not others

10,451

Solution 1

$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;

Or if you prefer with preg_match and you should be really using cURL instead of fgc...

function curl($url){

    $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
    $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $headers[]  = "Accept-Language:en-us,en;q=0.5";
    $headers[]  = "Accept-Encoding:gzip,deflate";
    $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[]  = "Keep-Alive:115";
    $headers[]  = "Connection:keep-alive";
    $headers[]  = "Cache-Control:max-age=0";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_ENCODING, "gzip");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($curl);
    curl_close($curl);
    return $data;

}


$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];

Solution 2

It just requires a user-agent ("any" really, any string suffices):

file_get_contents("http://www.freshdirect.com",false,stream_context_create(
    array("http" => array("user_agent" => "any"))
));

See more options.

Of course, you can set user_agent in your ini:

 ini_set("user_agent","any");
 echo file_get_contents("http://www.freshdirect.com");

... but I prefer to be explicit for the next programmer working on it.

Share:
10,451
Admin
Author by

Admin

Updated on June 26, 2022

Comments

  • Admin
    Admin about 2 years

    I'm looking to build a PHP script that parses HTML for particular tags. I've been using this code block, adapted from this tutorial:

    <?php 
    $data = file_get_contents('http://www.google.com');
    $regex = '/<title>(.+?)</';
    preg_match($regex,$data,$match);
    var_dump($match); 
    echo $match[1];
    ?>
    

    The script works with some websites (like google, above), but when I try it with other websites (like, say, freshdirect), I get this error:

    "Warning: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: failed to open stream: HTTP request failed!"

    I've seen a bunch of great suggestions on StackOverflow, for example to enable extension=php_openssl.dll in php.ini. But (1) my version of php.ini didn't have extension=php_openssl.dll in it, and (2) when I added it to the extensions section and restarted the WAMP server, per this thread, still no success.

    Would someone mind pointing me in the right direction? Thank you very much!