UTF-8 to Unicode Code Points

35,626

Solution 1

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

Solution 2

For a readable-form I would go with JSON. It's not required to escape non-ASCII characters in JSON, but PHP does:

echo json_encode("tchüß");

"tch\u00fc\u00df"

Solution 3

With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:

var_dump(sprintf('U+%04X', IntlChar::ord('ß')));

# Outputs: string(6) "U+00DF"

Solution 4

I guess you're going to print out your strings on a website?

I'm storing all my databases in uft8, using html_entities($string) before output.

Maybe you have to try html_entities(utf8_encode($string));

Solution 5

I once created a function called _convert() which encodes safely everything to UTF-8.

Share:
35,626
Adrien Hingert
Author by

Adrien Hingert

Updated on July 22, 2022

Comments

  • Adrien Hingert
    Adrien Hingert almost 2 years

    Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

    ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).

    EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

    function strToHex ($string)
    {
        $hex = '';
        for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
        {
            $id = ord (mb_substr ($string, $i, 1, "utf-8"));
            $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
    }
    
        return ($hex);
    }
    

    Any ideas?

    EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

  • Amit Patil
    Amit Patil over 12 years
    htmlentities only converts characters for which there are entities defined in the HTML language, though, which only covers a small subset of Unicode. Unfortunately it does not create &#...; character references for other characters.
  • Luwe
    Luwe over 12 years
    I'm aware, but also iconv tends to give some problems. Not all characters seem to get perfectly converted for every character set. That's why I mentioned the htmlentities function. It was also suggested in the comments on the iconv function page: nl.php.net/manual/en/function.iconv.php#81494
  • Adrien Hingert
    Adrien Hingert over 12 years
    Interesting, never thought of this!
  • Anthony
    Anthony about 11 years
    Brilliant! Works like a charm.. :)
  • eis
    eis almost 7 years
    Note that you need extension=php_intl.dll enabled in PHP.ini for this class to be present.
  • eis
    eis almost 7 years
    you could add the answer here, and not as a link.
  • William R
    William R about 6 years
    JSON requires, by default, the escaping of non-ASCII characters. And you should do it every time.
  • Basster
    Basster almost 6 years
    Great solution!
  • Ulrich Eckhardt
    Ulrich Eckhardt over 5 years
    @WilliamR, why do you think so? JSON is by definition UTF-8, which is fully Unicode-capable. Escaping anything that is Unicode is not necessary.
  • William R
    William R over 5 years
    Well, this is obvious to use UTF-8 for JSON. But escaping unicodes by ASCII ("é" comes \u00e9) is a good way to protect your data against a bad "charset" set in the headers of a HTTP transmission or over badly programmed code or even worse, a JSON inside a CDATA tag in a ISO-Latin1 XML file.