How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

34,399

Solution 1

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

Updated Modified code example, it now handles surrogate pairs.

Solution 2

You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

Something like this:

public static int truncateUtf8(String input, byte[] output) {
    
    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
    utf8Enc.encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

Solution 3

Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

Solution 4

UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.

check the stream at the character limit you want.

  • If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
  • If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
  • If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.

Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.

Solution 5

you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

Share:
34,399
Johan Lübcke
Author by

Johan Lübcke

Java developer longing for python...

Updated on July 09, 2022

Comments

  • Johan Lübcke
    Johan Lübcke almost 2 years

    How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?