Encode byte[] as String

java encoding utf-8 character-encoding byte

20,969

Solution 1

You should absolutely use base64 or possibly hex. (Either will work; base64 is more compact but harder for humans to read.)

You claim "both variants work perfectly" but that's actually not true. If you use the first approach and data is not actually a valid UTF-8 sequence, you will lose data. You're not trying to convert UTF-8-encoded text into a String, so don't write code which tries to do so.

Using ISO-8859-1 as an encoding will preserve all the data - but in very many cases the string that is returned will not be easily transported across other protocols. It may very well contain unprintable control characters, for example.

Only use the String(byte[], String) constructor when you've got inherently textual data, which you happen to have in an encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, just for example - you should use an approach which treats the incoming data as "arbitrary binary data" and finds a textual encoding of it... which is precisely what base64 and hex do.

Solution 2

You can store a byte in a String, though it's not a good idea. You can't use UTF-8 as this will mange the bytes but a faster and more efficient way is to use ISO-8859-1 encoding or plain 8-bit. The simplest way to do this is to use

String s1 = new String(data, 0);

String s1 = new String(data, "ISO-8859-1");

From UTF-8 on Wikipedia, As Jon Skeet notes, these encodings are not valid under the standard. Their behaviour in Java varies. DataInputStream treats them as the same for the first three version and the next two throw an exception. The Charset decoder treats them as separate characters silently.

00000000 is \0
11000000 10000000 is \0
11100000 10000000 10000000 is \0
11110000 10000000 10000000 10000000 is \0
11111000 10000000 10000000 10000000 10000000 is \0
11111100 10000000 10000000 10000000 10000000 10000000 is \0

This means if you see \0 in you String, you have no way of knowing for sure what the original byte[] values were. DataOutputStream uses the second option for compatibility with C which sees \0 as a terminator.

BTW DataOutputStream is not aware of code points so writes high code point characters in UTF-16 and then UTF-8 encoding.

0xFE and 0xFF are not valid to appear in a character. Values 0x11000000+ can only appear at the start of a character, not inside a multi-byte character.

Solution 3

Confirmed the accepted answer with Java. To repeat, UTF-8, UTF-16 do not preserve all the byte values. ISO-8859-1 does preserve all the byte values. But if the encoded bytes is to be transported beyond the JVM, use Base64.

@Test
public void testBase64() {
    final byte[] original = enumerate();
    final String encoded = Base64.encodeBase64String( original );
    final byte[] decoded = Base64.decodeBase64( encoded );
    assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testIso8859() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.ISO_8859_1 );
    final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
    assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testUtf16() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.UTF_16 );
    final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
    assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testUtf8() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.UTF_8 );
    final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
    assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testEnumerate() {
    final Set<Byte> byteSet = new HashSet<>();
    final byte[] bytes = enumerate();
    for ( byte b : bytes ) {
        byteSet.add( b );
    }
    assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}

/**
 * Enumerates all the byte values.
 */
private byte[] enumerate() {
    final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
    final byte[] bytes = new byte[length];
    for ( int i = 0; i < length; i++ ) {
        bytes[i] = (byte)(i + Byte.MIN_VALUE);
    }
    return bytes;
}

20,969

Author by

maxammann

Updated on July 09, 2022

Comments

maxammann almost 2 years
Heyho,

I want to convert byte data, which can be anything, to a String. My question is, whether it is "secure" to encode the byte data with UTF-8 for example:
```
String s1 = new String(data, "UTF-8");
```
or by using base64:
```
String s2 = Base64.encodeToString(data, false); //migbase64
```
I'm just afraid that using the first method has negative side effects. I mean both variants work p̶e̶r̶f̶e̶c̶t̶l̶y̶ , but s1 can contain any character of the UTF-8 charset, s2 only uses "readable" characters. I'm just not sure if it's really need to use base64. Basically I just need to create a String send it over the network and receive it again. (There is no other way in my situation :/)

The question is only about negative side effects, not if it's possible!
Vishy over 10 years

Even if the byte[] is valid you can still use data. This is because there is a one unique encoding for each character. e.g. Java could use 1 bytes for \0 but it chooses to use 2.
Jon Skeet over 10 years

@PeterLawrey: I don't understand your first sentence at all, or how it relates to the second...
maxammann over 10 years

kk very good answere, the only thing I don't get is, how I can lose data. Does java clear bytes if they are not valid UTF-8?
Vishy over 10 years

Let me try again, lost Internet so couldn't edit; Even if the byte[] is a valid UTF-8 encoding, you could lose data as there are multiple valid encodings for a given character, making reliable transformations back into the original byte[] impossible.
Vishy over 10 years

@p000ison UTF-8 doesn't use every possible byte value in every combination, This means some combinations are not valid. Some combinations produce the same char as others meaning there is no way to be sure what the original byte[] was.
Jon Skeet over 10 years

@PeterLawrey: I didn't think UTF-8 allowed multiple valid encodings for a single character. From wikipedia: "The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point."
maxammann over 10 years

k thanks now everything is clear, I wished I could accept both answers :D
Vishy over 10 years

Java's UTF-8 decoder sees 0b11000000, 0b100000000 as two characters.
Jon Skeet over 10 years

@PeterLawrey: Did you mean the second to be 0b10000000? As far as I can tell that's an overlong encoding. Java is decoding it as U+FFFD U+FFFD, where U+FFFD is the replacement character - effectively rejecting it correct. I don't count that as falling into your description of the byte[] being "a valid UTF-8 encoding".
WestCoastProjects about 7 years

What is the difference between that 0 - which i am unfamiliar with - and the standard approach of ISO-8859-1 ? Is the former a shorthand for the latter?
Vishy about 7 years

@javadba ISO-8859-1 will encode unsupported characters as ? whereas if you just take the lower 8-bits you are likely to get a somewhat random character.