UTF-8 byte[] to String

661,112

Solution 1

Look at the constructor for String

String str = new String(bytes, StandardCharsets.UTF_8);

And if you're feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly:

String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);

Solution 2

Java String class has a built-in-constructor for converting byte array to string.

byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};

String value = new String(byteArray, "UTF-8");

Solution 3

To convert utf-8 data, you can't assume a 1-1 correspondence between bytes and characters. Try this:

String file_string = new String(bytes, "UTF-8");

(Bah. I see I'm way to slow in hitting the Post Your Answer button.)

To read an entire file as a String, do something like this:

public String openFileToString(String fileName) throws IOException
{
    InputStream is = new BufferedInputStream(new FileInputStream(fileName));

    try {
        InputStreamReader rdr = new InputStreamReader(is, "UTF-8");
        StringBuilder contents = new StringBuilder();
        char[] buff = new char[4096];
        int len = rdr.read(buff);
        while (len >= 0) {
            contents.append(buff, 0, len);
        }
        return buff.toString();
    } finally {
        try {
            is.close();
        } catch (Exception e) {
            // log error in closing the file
        }
    }
}

Solution 4

You can use the String(byte[] bytes) constructor for that. See this link for details. EDIT You also have to consider your plateform's default charset as per the java doc:

Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array. The behavior of this constructor when the given bytes are not valid in the default charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

Solution 5

Knowing that you are dealing with a UTF-8 byte array, you'll definitely want to use the String constructor that accepts a charset name. Otherwise you may leave yourself open to some charset encoding based security vulnerabilities. Note that it throws UnsupportedEncodingException which you'll have to handle. Something like this:

public String openFileToString(String fileName) {
    String file_string;
    try {
        file_string = new String(_bytes, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        // this should never happen because "UTF-8" is hard-coded.
        throw new IllegalStateException(e);
    }
    return file_string;
}
Share:
661,112

Related videos on Youtube

skeryl
Author by

skeryl

loquacious/sarcastic

Updated on April 18, 2020

Comments

  • skeryl
    skeryl about 4 years

    Let's suppose I have just used a BufferedInputStream to read the bytes of a UTF-8 encoded text file into a byte array. I know that I can use the following routine to convert the bytes to a string, but is there a more efficient/smarter way of doing this than just iterating through the bytes and converting each one?

    public String openFileToString(byte[] _bytes)
    {
        String file_string = "";
    
        for(int i = 0; i < _bytes.length; i++)
        {
            file_string += (char)_bytes[i];
        }
    
        return file_string;    
    }
    
    • CoolBeans
      CoolBeans over 12 years
      Why can't you just do this String fileString = new String(_bytes,"UTF-8"); ?
    • Andy Thomas
      Andy Thomas over 12 years
      Alternatively, you could use BufferedReader to read into a char array.
    • Bruno
      Bruno over 12 years
    • skeryl
      skeryl over 12 years
      @CoolBeans I could if I had known to do that ;) Thank you.
    • Bruno
      Bruno over 12 years
      Depending on the file size, I'm not sure loading the whole byte[] in memory and converting it via new String(_bytes,"UTF-8") (or even by chunks with += on the string) is the most efficient. Chaining InputStreams and Readers might work better, especially on large files.
    • CoolBeans
      CoolBeans over 12 years
      @Bruno - That's a valid observation. I guess he will find out if he starts getting out of memory exceptions :)
    • Raedwald
      Raedwald about 9 years
      Your provided cide does not decode UTF-8. It does not handle any of the code points that require multiple bytes.
  • Mike Daniels
    Mike Daniels over 12 years
    And if your bytes are not in the platform's default charset, you can use the version that has the second Charset argument to make sure the conversion is correct.
  • zengr
    zengr over 12 years
    my dear lord. String str = new String(byte[]) will do just fine.
  • Ted Hopp
    Ted Hopp over 12 years
    This improves the efficiency, but it doesn't decode utf8 data properly.
  • GETah
    GETah over 12 years
    @MikeDaniels Indeed, I did not want to include all the details. Just edited my answer
  • Bruno
    Bruno over 12 years
    Sometimes, it's useful to keep the original line delimiters. The OP might want that.
  • siledh
    siledh over 10 years
    Or Guava's Charsets.UTF_8 if you are on JDK older than 1.7
  • scottt
    scottt over 10 years
    Code edited to make the default be utf-8 to match the OP's question.
  • Ben Clayton
    Ben Clayton over 9 years
    Use Guava's Charsets.UTF_8 if you are on Android API below 19 too
  • Attila Neparáczki
    Attila Neparáczki over 9 years
    And if checkstyle says: "Illegal Instantiation: Instantiation of java.lang.String should be avoided.", then what?
  • nyxz
    nyxz about 9 years
    You can see in here the java.nio.charset.Charset.availableCharsets() map all the charsets not just the charsets in the StandardCharsets. And if you want to use some other charset and still want to prevent the String constructor from throwing UnsupportedEncodingException you may use java.nio.charset.Charset.forName()
  • Aung Myat Hein
    Aung Myat Hein almost 8 years
    IOUtils.toString(inputStream, StandardCharsets.UTF_8) is deprecated now.
  • greg-449
    greg-449 about 7 years
    This doesn't specify a character set so you get the platform default character set which may well not be UTF-8.
  • Dayan
    Dayan almost 6 years
    This will crash on large data with out-of-memory on a TC75 Zebra device.
  • Admin
    Admin almost 6 years
    Isn't using the String constructor discouraged as it may result in having two different string objects containing the same character data?
  • programmerRaj
    programmerRaj over 2 years
    It worked for me without specifying StandardCharsets.UTF_8. I just did new String(bytes)
  • http8086
    http8086 about 2 years
    Constructs a new {@code String} by decoding the specified array of says in java source doc, wondering, if there are bytes can not be decoded in specified Charset, lets say UTF-8, then what will happen.