Converting these types of unicode to UTF8 in PHP

21,683

Solution 1

None of the other answers work perfectly as is. I've combined them together and my addition results in this one:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

This one definitely does work :)

Solution 2

I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $original_string);
//$unicodeString    = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); 

The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.

Enjoy!

Solution 3

it doesn't always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).

function unicode_conv($originalString) {
  // The four \\\\ in the pattern here are necessary to match \u in the original string
  $replacedString = preg_replace("/\\\\u(\w{4})/", "&#$1;", $originalString);
  $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
  return $unicodeString;
}

Solution 4

You should add 'x' after '#' in replacement string to indicate that hexadecimal numbers are used.

$replacedString = preg_replace("/\\\\u(\d{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

Solution 5

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');

More complete example:

// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
Share:
21,683
Simon
Author by

Simon

Updated on July 09, 2022

Comments

  • Simon
    Simon almost 2 years

    I am trying to convert this in to readable UTF8 text in PHP

    Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv
    

    Any ideas on how to do so?

    Tried several methods online, but couldn't find one.

    In this case I have unicode in Hebrew and Arabic

  • Simon
    Simon over 14 years
    Could you give me an example? I didn't understand the example in the link. Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "&#05ea;" or the first option you mentioned. Thanks for the help.
  • Amber
    Amber over 14 years
    Sure, I added a more complete example to my answer.
  • Alix Axel
    Alix Axel over 14 years
    @Dav: Why \\\\u? Isn't \\u enough? I also think that \d{2,4} would make it more complete.
  • Amber
    Amber over 14 years
    Alix: \u would be interpreted by the regex engine as an escape-code u, sort of like how \d is the set of digits, and \w is the set of "word" characters. Thus you need to actually escape the slash in the regex, which means your regex needs to be \\u, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\.
  • dzeikei
    dzeikei over 12 years
    I must mention that using mb_convert_encoding() method will convert any " in the original string into " because it involves parsing HTML!!! beware