Converting these types of unicode to UTF8 in PHP
Solution 1
None of the other answers work perfectly as is. I've combined them together and my addition results in this one:
$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
This one definitely does work :)
Solution 2
I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:
$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $original_string);
//$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.
Enjoy!
Solution 3
it doesn't always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).
function unicode_conv($originalString) {
// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\w{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
return $unicodeString;
}
Solution 4
You should add 'x' after '#' in replacement string to indicate that hexadecimal numbers are used.
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
Solution 5
See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX
pattern with the equivalent character.
Alternatively, you could replace each \uXXXX
pattern with its matching &#XXXX;
html entity form, and then use the following:
mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');
More complete example:
// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
Simon
Updated on July 09, 2022Comments
-
Simon almost 2 years
I am trying to convert this in to readable UTF8 text in PHP
Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv
Any ideas on how to do so?
Tried several methods online, but couldn't find one.
In this case I have unicode in Hebrew and Arabic
-
Simon over 14 yearsCould you give me an example? I didn't understand the example in the link. Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "ea;" or the first option you mentioned. Thanks for the help.
-
Amber over 14 yearsSure, I added a more complete example to my answer.
-
Alix Axel over 14 years@Dav: Why
\\\\u
? Isn't\\u
enough? I also think that\d{2,4}
would make it more complete. -
Amber over 14 yearsAlix:
\u
would be interpreted by the regex engine as an escape-code u, sort of like how\d
is the set of digits, and\w
is the set of "word" characters. Thus you need to actually escape the slash in the regex, which means your regex needs to be\\u
, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\. -
dzeikei over 12 yearsI must mention that using mb_convert_encoding() method will convert any " in the original string into " because it involves parsing HTML!!! beware