Convert Unicode from JSON string with PHP
Solution 1
It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3
really maps to the £
string.
What you should have is \u00a3
which is the unicode code point for £
.
{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.
If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.
function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}
Example here: http://phpfiddle.org/main/code/6sq-rkn
Edit:
If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:
function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}
Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.
Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.
Solution 2
The output is correct.
\u00c2 == Â
\u00a3 == £
So nothing is wrong here. And converting to HTML entities is easy:
htmlentities($title);
Solution 3
Here is an updated version of the function using preg_replace_callback
instead of preg_replace
.
function fixBadUnicodeForJson($str) {
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")); },
$str
);
return $str;
}
Comments
-
Alexander Holsgrove almost 2 years
I've been reading up on a few solutions but have not managed to get anything to work as yet.
I have a JSON string that I read in from an API call and it contains Unicode characters -
\u00c2\u00a3
for example is the £ symbol.I'd like to use PHP to convert these into either
£
or£
.I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:
$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
The output is
£
.Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?
UPDATE
It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:
That\u00e2\u0080\u0099s (right single quotation) \u00c2\u00a (pound symbol)
-
Alexander Holsgrove about 11 yearsThe first part is correct, but htmlentities($title) gives me �£
-
SirDarius about 11 yearsthe ouput is correct, but it is obvious that the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point.
-
Alexander Holsgrove about 11 yearsThanks for this. Can I run that on the entire string before|after calling json_decode to save calling 'fixBadUnicode' multiple times.
-
SirDarius about 11 yearsyou can run it before json_decode, however be careful that this might lead your json string to contain illegal characters, see json.org for the list of characters that can exist in json strings.
-
Alexander Holsgrove about 11 yearsJust for reference, the JSON is from the Hot UK Deals API. I didn't want to mess about with the default XML feed type
-
Alexander Holsgrove about 11 yearsIf I run it on the raw JSON, it converts the '\u00c2\u00a3' to '�'. I also found \u0099 is left unchanged - I think this is an apostrophe. Seems like a really poor JSON data feed!
-
Alexander Holsgrove about 11 yearsThat's great - thank you. I don't need the encoded JSON after it has been 'fixed' as I need to iterate through the data. Can I instead call json_decode and then preg_replace(...) without needing to call json_encode and the substr?
-
SirDarius about 11 years@AlexHolsgrove I'm afraid no.
fixBadUnicodeForJson
will have to be called first on the raw json data, then use json_decode on the result, and you're good. -
Alexander Holsgrove about 11 yearsIt seems to find more invalid UTF-8 data. I setup a demo here (where you can also see the raw JSON): phpfiddle.org/main/code/rfk-50n
-
Alexander Holsgrove about 11 yearsDo I need to run the 'fix' twice? I can't see how to get it to decode the json as it won't return the array.
-
SirDarius about 11 yearsYou need to take into account UTF-8 characters with more than two bytes... see my edit :)
-
Hossein Jabbari over 8 yearspreg_replace "e" is deprecated, can you write this in the format of "preg_replace_callback" ?