Converting these types of unicode to UTF8 in PHP

php unicode utf-8

21,683

Solution 1

None of the other answers work perfectly as is. I've combined them together and my addition results in this one:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

This one definitely does work :)

Solution 2

I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $original_string);
//$unicodeString    = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.

Enjoy!

Solution 3

it doesn't always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).

function unicode_conv($originalString) {
  // The four \\\\ in the pattern here are necessary to match \u in the original string
  $replacedString = preg_replace("/\\\\u(\w{4})/", "&#$1;", $originalString);
  $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
  return $unicodeString;
}

Solution 4

You should add 'x' after '#' in replacement string to indicate that hexadecimal numbers are used.

$replacedString = preg_replace("/\\\\u(\d{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

Solution 5

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');

More complete example:

// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

View more solutions

21,683

Author by

Simon

Updated on July 09, 2022

Comments

Simon almost 2 years

I am trying to convert this in to readable UTF8 text in PHP

Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv

Any ideas on how to do so?

Tried several methods online, but couldn't find one.

In this case I have unicode in Hebrew and Arabic

Simon over 14 years

Could you give me an example? I didn't understand the example in the link. Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "&#05ea;" or the first option you mentioned. Thanks for the help.
Amber over 14 years

Sure, I added a more complete example to my answer.
Alix Axel over 14 years

@Dav: Why \\\\u? Isn't \\u enough? I also think that \d{2,4} would make it more complete.
Amber over 14 years

Alix: \u would be interpreted by the regex engine as an escape-code u, sort of like how \d is the set of digits, and \w is the set of "word" characters. Thus you need to actually escape the slash in the regex, which means your regex needs to be \\u, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\.
dzeikei over 12 years

I must mention that using mb_convert_encoding() method will convert any " in the original string into " because it involves parsing HTML!!! beware