utf-8 decoding in java
Solution 1
When dealing with Strings, always remember: byte
!= char
. So in your first example, you have the char c3
, not the byte c3
which is a huge difference: The byte
would be part of the UTF-8 sequence but the char
already is Unicode. So when you convert that to UTF-8, the Unicode character c3
must become the byte
sequence c3 83
.
So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded byte
sequences.
The reason why ISO-8859-1
usually works is that this encoding doesn't modify any char
with a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encoded byte
sequences won't be modified.
Your last example is also wrong: The char e9
is é in ISO-8859-1
and Unicode. In UTF-8, it's not valid since it's not a byte
and since it's the byte c3
prefix is missing. That said, it correctly represents the Unicode string you seek.
Solution 2
If you start with the Java String where "d\u00C3\u00A9jeuner".equals(stmt)
then the data is already corrupt at this stage.
A Java char
is not a C char
. A char
in Java is 16bits wide and implicitly contains UTF-16 encoded data. Trying to store any other encoded data in a Java char
/String type is asking for trouble. Character data in any other encoding should be as byte
data.
If you are reading the parameter using the servlet API, then it is likely that the HTTP request contains inconsistent or insufficient encoding information. Check the calling code and the HTTP headers. It is likely that the client is encoding the data as UTF-8, but the servlet is decoding it as ISO-8859-1.
user162346
Updated on February 17, 2020Comments
-
user162346 about 4 years
I'm trying to pass parameters from a PHP middle tier to a java backend that understands J2EE. I'm writing the controller code in Groovy. In there, I'm trying to decode some parameter that will likely contain international characters.
I am really puzzled by the results of my debugging this problem so far, hence I wanted to share it with you in the hope that someone will be able to give the correct interpretation of my results.
For the sake of my little test, the parameter I'm passing is "déjeuner". Just to be sure, System.out.println("déjeuner") correctly gives me:
déjeuner
in the console
Now following are the char/dec and hex values of each char of the original string:
next char: d 100 64 next char: ? -61 c3 next char: ? -87 a9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note that the c3a9 sequence in UTF-8 is the wished-for character: http://www.fileformat.info/info/unicode/char/00e9/index.htm
Now if I try to read this string as an UTF-8 string, as in stmt.getBytes("UTF-8"), I suddenly end up having a 11 bytes sequence, as follows:
64 c3 83 c2 a9 6a 65 75 6e 65 72
whereas stmt.getBytes("iso-8859-1") gives me 9 bytes:
64 c3 a9 6a 65 75 6e 65 72
note the c3a9 sequence here!
now if I try to convert the UTF-8 sequence to UTF-8, as in
new String(stmt.getBytes("UTF-8"), "UTF-8");
I get:
next char: d 100 64 next char: ? -61 c3 next char: ? -87 a9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note the c3a9 sequence
while
new String(stmt.getBytes("iso-8859-1"), "UTF-8")
results in:
next char: d 100 64 next char: ? -23 e9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note the e9 which in utf-8 (and ascii) is, again, the 'é' character that I'm longing for.
Unfortunately, in neither case am I ending up with a proper string that would display like the literal string "déjeuner". Strangely enough, the byte sequences both seem correct though.
-
user162346 over 14 yearsThanks for the very informative answer. So it boils down to request.getParameter() in javax.servlet.http.HttpServletRequest to not correctly handle UTF-8 encoded byte sequences, right? I have called req.setCharacterEncoding("UTF-8") on it though. What possible workaround am I being left with? It still isn't clear for me how I get the original data for my parameters (its bytes, not chars) so I can get some non-buggy String implementation to work out the right UTF string out of it...
-
Aaron Digulla over 14 yearsMy guess is that the sender encodes the data with UTF-8 but fails to set the correct HTTP headers for this.
-
Aaron Digulla over 14 yearsSo make sure that the PHP part generates web pages that correctly specify their encoding, especially in forms.
-
Aaron Digulla over 14 yearsAfter that, the Java code should decode the data correctly without any manual corrections by you.
-
user162346 over 14 yearsYes you are totally right. The culprit was the php cUrl code, which only worked for me in POST mode. Also, on the return path (getting the string back from the database and to php through groovy), I had some more problems that I solved by following the instructions given here: mathiasrichter.blogspot.com/2009/10/…
-
BalusC about 14 yearsHi, welcome at Stackoverflow! Please do not post own questions as answers in other's questions! They will get lost in noise and nobody would respond on your question. Just post a question by clicking
Ask Question
button at the right top. Once done that, please delete this noise from this topic as well.