How to convert string with html encoding to Unicode in java

13,959

Solution 1

In Java, for a unicode string literal, you do \u before the number.

For example:

System.out.println("\u0042");
System.out.println("\u00AF\\_(\u30C4)_/\u00AF");

Prints:

B
¯\_(ツ)_/¯

What you want is:

System.out.println("\u00D0\u1ED9t nhi\u00EAn, \u1EDF g\u1ED1c T\u00E2y B\u1EAFc v\u0103ng v\u1EB3ng c\u00F3 ti\u1EBFng v\u00F3 ng\u1EF1a d\u1ED3n d\u1EADp.\n");

Prints:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

EDIT: Apache commons is the best way to go:

StringEscapeUtils.unescapeHtml4();.

Solution 2

Use Apache Commons StringEscapeUtils.unescapeHtml(string) for this.

Refer: Java: How to unescape HTML character entities in Java?

Share:
13,959
ThaiPD
Author by

ThaiPD

Updated on June 04, 2022

Comments

  • ThaiPD
    ThaiPD almost 2 years

    enter code hereI have a problem with html encoding. I have a string with html encoding like below :

    Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.
    

    I want to convert this String to Unicode. Its output (actual value) should be

    Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.
    

    I tried to find out the solution as this sugest but it just can help for string with all character has format begin with &#. with characters begin by &xxxx, by this page I got its encode is html encoding but my input string is the combine of convert HTML Entity (named) and HTML Entity (decimal).

    Can anyone please give me a suggestion ? It's the best if you can solve it without any additional library in java.

    thanks in advance!

    [UPDATE] I solved my problem by using Apache library :

    String encodeString = "Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.";
        String unEncodeString = StringEscapeUtils.unescapeHtml4(encodeString);
        System.out.println("OUTPUT : " + unEncodeString);
    

    =====> OUTPUT : Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

  • ThaiPD
    ThaiPD over 9 years
    Thank you for your answer but I mean how can I convert string "Ðột" to "Đột" string. I have existing input and I want to get output as above. Could you please help more ?
  • ThaiPD
    ThaiPD over 9 years
    is there any way with out Apache library? I want to fix it with out add-on library...