file_get_contents() converts UTF-8 to ISO-8859-1

27,450

Solution 1

This seems to be a content negotiation problem as file_get_contents probably sends a request that only accepts ISO 8859-1 as character encoding.

You can create a custom stream context for file_get_contents using stream_context_create that explicitly states that you accept UTF-8:

$opts = array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context = stream_context_create($opts);

$filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
echo file_get_contents($filename, false, $context);

Solution 2

file_get_contents should not change the charset. The data is pulled in as a binary string.

When checking out the url you provided, this is the header it provides:

Content-Type: text/html; charset=ISO-8859-1

Also, in the body:

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

Also, you can't convert UTF-8 losslessly convert to ISO-8859-1 and get the characters back when going back to UTF-8. UTF-8 / unicode supports many many more characters, so the characters are lost in the first step.

In the browser this is not the case, so perhaps you just need to provide a correct Accept-Encoding header to instruct yahoo's system you can accept UTF-8.

Solution 3

For anyone investigating on this:

The time I spent on encoding issues taught me that rarely php functions "magically" change the encoding of strings. (One of these rare examples is :

exec( $command, $output, $returnVal )

Please note also that the working header set is as follows:

header('Content-Type: text/html; charset=utf-8');

and not:

header('Content-Type: text/html; charset=UTF-8');

As I had a similar issue as the one you describe, it was enough to set the headers properly.

Hope this helps!

Solution 4

$s2 = iconv("ISO-8859-1","UTF-8//TRANSLIT//IGNORE",$filename );

Better solution...

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_ENCODING, 1);
    return curl_exec($ch);
    curl_close($ch);
}

echo curl($filename);
Share:
27,450
vladinko0
Author by

vladinko0

Updated on June 19, 2020

Comments

  • vladinko0
    vladinko0 almost 4 years

    I am trying to get search results from yahoo.com.

    But file_get_contents() converts UTF-8 charset (charset, that yahoo uses) content to ISO-8859-1.

    Try:

    $filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
    
    echo file_get_contents($filename);
    

    Scripts as

    header('Content-Type: text/html; charset=UTF-8');
    

    or

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    

    or

    $er = mb_convert_encoding($filename , 'UTF-8');
    

    or

    $s2 = iconv("ISO-8859-1","UTF-8",$filename );
    

    or

    echo utf8_encode(file_get_contents($filename));
    

    NOT help, because after getting web content speciall characters as š ť ž are replaced with question marks ???

    I would appreciate any kind of help.

  • vladinko0
    vladinko0 about 13 years
    Result is: The document has moved here.
  • vladinko0
    vladinko0 about 13 years
    How did you find out Content-Type: text/html; charset=ISO-8859-1 and <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> When I look in source code of that page I see <!doctype html><html lang="en"><head><meta http-equiv="content-type" content="text/html; charset=UTF-8">
  • Dejan Marjanović
    Dejan Marjanović about 13 years
    @vladinko0, I think you need to set CURLOPT_FOLLOWLOCATION, I've updated my answer, try again.
  • vladinko0
    vladinko0 about 13 years
    Now it loads the page, but with the same result as with file_get_contents() it means with question marks. Charset is also converted to ISO-8859-1.
  • Dejan Marjanović
    Dejan Marjanović about 13 years
    It seems that yahoo.com is serving different pages (charsets) depending on your IP (country). I changed your URL to http://ru.search.yahoo.com but it doesn't work. Maybe you can achieve something with accept charset headers,refuse ISO-8859-1...
  • Dejan Marjanović
    Dejan Marjanović about 13 years
    It serves different encoding based on your location, you could try fetching page using Russian proxy servers.
  • Dejan Marjanović
    Dejan Marjanović about 13 years
    Funny thing, I tried Accept-Charset=utf-8;q=0.7,*;q=0.7, but doesn't work :)
  • Gumbo
    Gumbo about 13 years
    @webarto: The value utf-8;q=0.7,*;q=0.7 is like utf-8,* and would accept any character encoding equally.
  • Craig Morgan
    Craig Morgan about 10 years
    Nice one Gumbo! I was struggling with umlauts in the url (München) - this solved the problem. Thanks!