Convert Unicode from JSON string with PHP

20,722

Solution 1

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
    return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
    return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

Solution 2

The output is correct.

\u00c2 == Â
\u00a3 == £

So nothing is wrong here. And converting to HTML entities is easy:

htmlentities($title);

Solution 3

Here is an updated version of the function using preg_replace_callback instead of preg_replace.

function fixBadUnicodeForJson($str) {
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")); },
    $str
);
    return $str;
}
Share:
20,722
Alexander Holsgrove
Author by

Alexander Holsgrove

Lead WordPress developer at Infotex

Updated on July 05, 2022

Comments

  • Alexander Holsgrove
    Alexander Holsgrove almost 2 years

    I've been reading up on a few solutions but have not managed to get anything to work as yet.

    I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.

    I'd like to use PHP to convert these into either £ or £.

    I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:

    $title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
    

    The output is £.

    Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?

    UPDATE

    It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:

    That\u00e2\u0080\u0099s (right single quotation)
    \u00c2\u00a (pound symbol)
    
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    The first part is correct, but htmlentities($title) gives me �£
  • SirDarius
    SirDarius about 11 years
    the ouput is correct, but it is obvious that the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point.
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    Thanks for this. Can I run that on the entire string before|after calling json_decode to save calling 'fixBadUnicode' multiple times.
  • SirDarius
    SirDarius about 11 years
    you can run it before json_decode, however be careful that this might lead your json string to contain illegal characters, see json.org for the list of characters that can exist in json strings.
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    Just for reference, the JSON is from the Hot UK Deals API. I didn't want to mess about with the default XML feed type
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    If I run it on the raw JSON, it converts the '\u00c2\u00a3' to '�'. I also found \u0099 is left unchanged - I think this is an apostrophe. Seems like a really poor JSON data feed!
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    That's great - thank you. I don't need the encoded JSON after it has been 'fixed' as I need to iterate through the data. Can I instead call json_decode and then preg_replace(...) without needing to call json_encode and the substr?
  • SirDarius
    SirDarius about 11 years
    @AlexHolsgrove I'm afraid no. fixBadUnicodeForJson will have to be called first on the raw json data, then use json_decode on the result, and you're good.
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    It seems to find more invalid UTF-8 data. I setup a demo here (where you can also see the raw JSON): phpfiddle.org/main/code/rfk-50n
  • Alexander Holsgrove
    Alexander Holsgrove about 11 years
    Do I need to run the 'fix' twice? I can't see how to get it to decode the json as it won't return the array.
  • SirDarius
    SirDarius about 11 years
    You need to take into account UTF-8 characters with more than two bytes... see my edit :)
  • Hossein Jabbari
    Hossein Jabbari over 8 years
    preg_replace "e" is deprecated, can you write this in the format of "preg_replace_callback" ?