UTF-8 to Unicode Code Points
Solution 1
Converting one character set to another can be done with iconv:
http://php.net/manual/en/function.iconv.php
Note that UTF is already an Unicode encoding.
Another way is simply using htmlentities with the right character set:
http://php.net/manual/en/function.htmlentities.php
Solution 2
For a readable-form I would go with JSON. It's not required to escape non-ASCII characters in JSON, but PHP does:
echo json_encode("tchüß");
"tch\u00fc\u00df"
Solution 3
With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:
var_dump(sprintf('U+%04X', IntlChar::ord('ß')));
# Outputs: string(6) "U+00DF"
Solution 4
I guess you're going to print out your strings on a website?
I'm storing all my databases in uft8, using html_entities($string) before output.
Maybe you have to try html_entities(utf8_encode($string));
Solution 5
I once created a function called _convert() which encodes safely everything to UTF-8.
Adrien Hingert
Updated on July 22, 2022Comments
-
Adrien Hingert almost 2 years
Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?
ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).
EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:
function strToHex ($string) { $hex = ''; for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++) { $id = ord (mb_substr ($string, $i, 1, "utf-8")); $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";"; } return ($hex); }
Any ideas?
EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032
-
Amit Patil over 12 years
htmlentities
only converts characters for which there are entities defined in the HTML language, though, which only covers a small subset of Unicode. Unfortunately it does not create&#...;
character references for other characters. -
Luwe over 12 yearsI'm aware, but also
iconv
tends to give some problems. Not all characters seem to get perfectly converted for every character set. That's why I mentioned thehtmlentities
function. It was also suggested in the comments on theiconv
function page: nl.php.net/manual/en/function.iconv.php#81494 -
Adrien Hingert over 12 yearsInteresting, never thought of this!
-
Anthony about 11 yearsBrilliant! Works like a charm.. :)
-
eis almost 7 yearsNote that you need extension=php_intl.dll enabled in PHP.ini for this class to be present.
-
eis almost 7 yearsyou could add the answer here, and not as a link.
-
William R about 6 yearsJSON requires, by default, the escaping of non-ASCII characters. And you should do it every time.
-
Basster almost 6 yearsGreat solution!
-
Ulrich Eckhardt over 5 years@WilliamR, why do you think so? JSON is by definition UTF-8, which is fully Unicode-capable. Escaping anything that is Unicode is not necessary.
-
William R over 5 yearsWell, this is obvious to use UTF-8 for JSON. But escaping unicodes by ASCII ("é" comes \u00e9) is a good way to protect your data against a bad "charset" set in the headers of a HTTP transmission or over badly programmed code or even worse, a JSON inside a CDATA tag in a ISO-Latin1 XML file.