file_get_contents() Breaks Up UTF-8 Characters

php utf-8 file-get-contents

124,056

Solution 1

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Solution 2

I had similar problem with polish language

I tried:

$fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true));

I tried:

$fileEndEnd = utf8_encode ( $fileEndEnd );

I tried:

$fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd );

And then -

$fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");

This last worked perfectly !!!!!!

Solution 3

function file_get_contents_utf8($fn) {
     $content = file_get_contents($fn);
      return mb_convert_encoding($content, 'UTF-8',
          mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

You might also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php

Solution 4

I think you simply have a double conversion of the character type there :D

It may be, because you opened an html document within a html document. So you have something that looks like this in the end

<!DOCTYPE html> 
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<body>
<!DOCTYPE html> 
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>.......

The use of mb_detect_encoding therefore may lead you to other issues.

Solution 5

İn Turkish language, mb_convert_encoding or any other charset conversion did not work.

And also urlencode did not work because of space char converted to + char. It must be %20 for percent encoding.

This one worked!

   $url = rawurlencode($url);
   $url = str_replace("%3A", ":", $url);
   $url = str_replace("%2F", "/", $url);

   $data = file_get_contents($url);

View more solutions

124,056

Author by

Richard Knop

I'm a software engineer mostly working on backend from 2011. I have used various languages but has been mostly been writing Go code since 2014. In addition, I have been involved in lot of infra work and have experience with various public cloud platforms, Kubernetes, Terraform etc. For databases I have used lot of Postgres and MySQL but also Redis and other key value or document databases. Check some of my open source projects: https://github.com/RichardKnop/machinery https://github.com/RichardKnop/go-oauth2-server https://github.com/RichardKnop

Updated on July 05, 2022

Comments

Richard Knop almost 2 years

I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ľ,š,č,ť,ž etc. When I load the HTML with file_get_contents() like this:

$html = file_get_contents('http://example.com/foreign.html');

It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.

How can I solve this?

UPDATE:

I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.

UPDATE2:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk">
<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Language" content="sk" />
<title>Test</title>

</head>
<body>


<?php

$html = file_get_contents('http://example.com');
echo htmlentities($html);

?>

</body>
</html>

helpse almost 10 years

This should be marked as best answer. Thanks Gordon.
artur99 over 8 years

it gives me ã insetead of ă and ª instead of Ș :((
JustRandom about 8 years

For all germans use iso-8859-1 instead of UTF-8. This will fix äöüß for you. Great fix thanks.
Reado almost 7 years

file_get_contents() is causing the problem. I had a JSON file I was opening with file_get_contents() but upon doing a print_r() after loading the JSON, the unicode characters were there but not in the JSON. Performing the mb_convert_encoding() on the file_get_contents() fixed the problem.
Reado almost 7 years

THIS should be the answer! For me, file_get_contents() was converting £ to the unicode version. Using mb_convert_encoding() after using file_get_contents() resolved the issue. Thank you!
Hatem Badawi over 6 years

after 5 hours of working that answer saved my day .... great man thanks
WEBjuju about 6 years

$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8"); solved it for me.
dynamichael over 5 years

Easy, simple, perfect.
Peter Artoung over 5 years

Perfect : $fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");
ashleedawg almost 5 years

This answer led me to a solution (to correctly display French-Canadian characters from a .csv from StatsCanada) which was to use mb_convert_encoding like: $myString=mb_convert_encoding($myString, 'UTF-8', "ISO-8859-1"); . . . with hints from this answer and this list of Supported Character Encodings. Thanks!
Robby Cornelissen over 4 years

How is this different from this answer?