file_get_contents() Breaks Up UTF-8 Characters

124,056

Solution 1

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Solution 2

I had similar problem with polish language

I tried:

$fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true));

I tried:

$fileEndEnd = utf8_encode ( $fileEndEnd );

I tried:

$fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd );

And then -

$fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");

This last worked perfectly !!!!!!

Solution 3

Solution suggested in the comments of the PHP manual entry for file_get_contents

function file_get_contents_utf8($fn) {
     $content = file_get_contents($fn);
      return mb_convert_encoding($content, 'UTF-8',
          mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

You might also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php

Solution 4

I think you simply have a double conversion of the character type there :D

It may be, because you opened an html document within a html document. So you have something that looks like this in the end

<!DOCTYPE html> 
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<body>
<!DOCTYPE html> 
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>.......

The use of mb_detect_encoding therefore may lead you to other issues.

Solution 5

İn Turkish language, mb_convert_encoding or any other charset conversion did not work.

And also urlencode did not work because of space char converted to + char. It must be %20 for percent encoding.

This one worked!

   $url = rawurlencode($url);
   $url = str_replace("%3A", ":", $url);
   $url = str_replace("%2F", "/", $url);

   $data = file_get_contents($url);
Share:
124,056
Richard Knop
Author by

Richard Knop

I'm a software engineer mostly working on backend from 2011. I have used various languages but has been mostly been writing Go code since 2014. In addition, I have been involved in lot of infra work and have experience with various public cloud platforms, Kubernetes, Terraform etc. For databases I have used lot of Postgres and MySQL but also Redis and other key value or document databases. Check some of my open source projects: https://github.com/RichardKnop/machinery https://github.com/RichardKnop/go-oauth2-server https://github.com/RichardKnop

Updated on July 05, 2022

Comments

  • Richard Knop
    Richard Knop almost 2 years

    I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ľ,š,č,ť,ž etc. When I load the HTML with file_get_contents() like this:

    $html = file_get_contents('http://example.com/foreign.html');
    

    It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.

    How can I solve this?

    UPDATE:

    I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.

    UPDATE2:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk">
    <head>
    
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <meta http-equiv="Content-Language" content="sk" />
    <title>Test</title>
    
    </head>
    <body>
    
    
    <?php
    
    $html = file_get_contents('http://example.com');
    echo htmlentities($html);
    
    ?>
    
    </body>
    </html>
    
  • helpse
    helpse almost 10 years
    This should be marked as best answer. Thanks Gordon.
  • artur99
    artur99 over 8 years
    it gives me ã insetead of ă and ª instead of Ș :((
  • JustRandom
    JustRandom about 8 years
    For all germans use iso-8859-1 instead of UTF-8. This will fix äöüß for you. Great fix thanks.
  • Reado
    Reado almost 7 years
    file_get_contents() is causing the problem. I had a JSON file I was opening with file_get_contents() but upon doing a print_r() after loading the JSON, the unicode characters were there but not in the JSON. Performing the mb_convert_encoding() on the file_get_contents() fixed the problem.
  • Reado
    Reado almost 7 years
    THIS should be the answer! For me, file_get_contents() was converting £ to the unicode version. Using mb_convert_encoding() after using file_get_contents() resolved the issue. Thank you!
  • Hatem Badawi
    Hatem Badawi over 6 years
    after 5 hours of working that answer saved my day .... great man thanks
  • WEBjuju
    WEBjuju about 6 years
    $string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8"); solved it for me.
  • dynamichael
    dynamichael over 5 years
    Easy, simple, perfect.
  • Peter Artoung
    Peter Artoung over 5 years
    Perfect : $fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");
  • ashleedawg
    ashleedawg almost 5 years
    This answer led me to a solution (to correctly display French-Canadian characters from a .csv from StatsCanada) which was to use mb_convert_encoding like: $myString=mb_convert_encoding($myString, 'UTF-8', "ISO-8859-1"); . . . with hints from this answer and this list of Supported Character Encodings. Thanks!
  • Robby Cornelissen
    Robby Cornelissen over 4 years
    How is this different from this answer?