How to count String bytes properly?

15,497

The word endereço should return me length 9 instead of 8.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.


String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc :

Returns the length of this string. The length is equal to the number of Unicode code units in the string.


The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding. But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc :

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().


Solution

byte[] getBytes() method comes with two other very useful overloads :

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

  • byte[] java.lang.String.getBytes(Charset charset)

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

And you will get a length of 9 bytes as you wish.

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

Output on my machine that is Windows OS based:

default charset = windows-1252

getBytes() with default charset, size = 8

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16), size = 18

Share:
15,497
Philippe Gioseffi
Author by

Philippe Gioseffi

Father and Java Architect.

Updated on June 12, 2022

Comments

  • Philippe Gioseffi
    Philippe Gioseffi almost 2 years

    A containing special chars such as ç takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

    How can I count correctly the number of bytes in a String?

    Example:

    The word endereço should return me length 9 instead of 8.