Convert ANSI characters to UTF-8 in Java

40,943

Solution 1

This error is not caused by character encoding. It means the length of the UTF data is wrong.

EDIT: Just realized this is a writing error, not reading error.

The UTF length is only 2 bytes so it can only hold 64K UTF-8 bytes. You are trying to writing 100K, it's not going to work.

This limit is hardcoded and no way to get around this,

if (utflen > 65535)
    throw new UTFDataFormatException(
            "encoded string too long: " + utflen + " bytes");

Solution 2

byte[] asciiBytes = ...;
String unicode = new String(asciiBytes, "US-ASCII");
byte[] utfBytes = unicode.getBytes("UTF-8");

Solution 3

Which ANSI codepage? There are lots of different character encodings which all refer to "ANSI". The DOS codepage is 437 (without the drawing symbols). If you use codepage 850, this will work:

String unicode = new String(bytes, "IBM850");

(where bytes is an array with the ANSI characters). After that, you can convert this string into a byte array with any encoding using unicode.getBytes(encoding).

Windows often uses the codepage 1252 (use "windows-1252" for that).

Solution 4

ZZ Coder already answered the question, but I have written a more detailed explanation and suggesting a workaround on this blog. Basically, the problem is in DataOutputStream, because it restricts the writeable String to 64KB. There are other possible workarounds to bystep the issue, some might work without breaking the actual binary data format one is using...

Share:
40,943
n002213f
Author by

n002213f

(all-round polyglot) developer, (frequent) doer, (constant) dreamer, (wanna-be) writer, (proud-to-be) dad, (born) free vambita.com

Updated on July 09, 2022

Comments

  • n002213f
    n002213f almost 2 years

    Is there a way to convert an ANSI string to UTF using Java.

    I have a custom serializer that uses readUTF & writeUTF methods of the DataInputStream class to deserialize and serialze string. If i receive a string encoded in ANSI and is too long, ~100000 chars long i get the error;

    Caused by: java.io.UTFDataFormatException: encoded string too long: 106958 bytes

    However in my Junit tests i'm able create a string with 120000 'a's and it works perfectly

    I have checked the following posts but still having errors;

  • n002213f
    n002213f over 14 years
    Interesting, but why do all my tests with more characters pass?
  • ZZ Coder
    ZZ Coder over 14 years
    You have to show me your test cases. They are wrong. See my edits.
  • n002213f
    n002213f over 14 years
    i used the following code to generate the test string; StringBuffer sb2 = new StringBuffer(); for (int i=0; i < 120000;i++) { sb2.append("a"); } String longString2 = sb2.toString();
  • n002213f
    n002213f over 14 years
    tried it but does not work, i get the same error. Is there a way to check the encoding in a string so that i can be sure its ANSI?
  • ZZ Coder
    ZZ Coder over 14 years
    You can create long strings, until memory is out. You just can't write long strings using writeUTF(). Write it your own way with a 4 byte length header.
  • iammichael
    iammichael over 14 years
    It seems I misread the original question regarding ASCII vs. ANSI, and with the latest question edits, my answer is not really relevant.
  • Thufir
    Thufir over 10 years
    this will convert ANSI from telnet, like a mud game, to a "regular" String?
  • Aaron Digulla
    Aaron Digulla over 10 years
    This will convert bytes from any source to a Unicode string. But for it to work properly, you need to know exactly which encoding the source is using. It doesn't matter if that's a file, a remote service or a hardware device.