file_get_contents() Breaks Up UTF-8 Characters
Solution 1
Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.
See this question: Why Does DOM Change Encoding?
Solution 2
I had similar problem with polish language
I tried:
$fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true));
I tried:
$fileEndEnd = utf8_encode ( $fileEndEnd );
I tried:
$fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd );
And then -
$fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");
This last worked perfectly !!!!!!
Solution 3
Solution suggested in the comments of the PHP manual entry for file_get_contents
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
You might also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php
Solution 4
I think you simply have a double conversion of the character type there :D
It may be, because you opened an html document within a html document. So you have something that looks like this in the end
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<body>
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>.......
The use of mb_detect_encoding
therefore may lead you to other issues.
Solution 5
İn Turkish language, mb_convert_encoding or any other charset conversion did not work.
And also urlencode did not work because of space char converted to + char. It must be %20 for percent encoding.
This one worked!
$url = rawurlencode($url);
$url = str_replace("%3A", ":", $url);
$url = str_replace("%2F", "/", $url);
$data = file_get_contents($url);
Richard Knop
I'm a software engineer mostly working on backend from 2011. I have used various languages but has been mostly been writing Go code since 2014. In addition, I have been involved in lot of infra work and have experience with various public cloud platforms, Kubernetes, Terraform etc. For databases I have used lot of Postgres and MySQL but also Redis and other key value or document databases. Check some of my open source projects: https://github.com/RichardKnop/machinery https://github.com/RichardKnop/go-oauth2-server https://github.com/RichardKnop
Updated on July 05, 2022Comments
-
Richard Knop almost 2 years
I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ľ,š,č,ť,ž etc. When I load the HTML with file_get_contents() like this:
$html = file_get_contents('http://example.com/foreign.html');
It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.
How can I solve this?
UPDATE:
I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.
UPDATE2:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta http-equiv="Content-Language" content="sk" /> <title>Test</title> </head> <body> <?php $html = file_get_contents('http://example.com'); echo htmlentities($html); ?> </body> </html>
-
helpse almost 10 yearsThis should be marked as best answer. Thanks Gordon.
-
artur99 over 8 yearsit gives me ã insetead of ă and ª instead of Ș :((
-
JustRandom about 8 yearsFor all germans use iso-8859-1 instead of UTF-8. This will fix äöüß for you. Great fix thanks.
-
Reado almost 7 yearsfile_get_contents() is causing the problem. I had a JSON file I was opening with file_get_contents() but upon doing a print_r() after loading the JSON, the unicode characters were there but not in the JSON. Performing the mb_convert_encoding() on the file_get_contents() fixed the problem.
-
Reado almost 7 yearsTHIS should be the answer! For me, file_get_contents() was converting £ to the unicode version. Using mb_convert_encoding() after using file_get_contents() resolved the issue. Thank you!
-
Hatem Badawi over 6 yearsafter 5 hours of working that answer saved my day .... great man thanks
-
WEBjuju about 6 years
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
solved it for me. -
dynamichael over 5 yearsEasy, simple, perfect.
-
Peter Artoung over 5 yearsPerfect :
$fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");
-
ashleedawg almost 5 yearsThis answer led me to a solution (to correctly display French-Canadian characters from a
.csv
from StatsCanada) which was to usemb_convert_encoding
like:$myString=mb_convert_encoding($myString, 'UTF-8', "ISO-8859-1");
. . . with hints from this answer and this list of Supported Character Encodings. Thanks! -
Robby Cornelissen over 4 yearsHow is this different from this answer?