Split Java String in chunks of 1024 bytes

java string split byte

35,908

Solution 1

Strings and bytes are two completely different things, so wanting to split a String into bytes is as meaningless as wanting to split a painting into verses.

What is it that you actually want to do?

To convert between strings and bytes, you need to specify an encoding that can encode all the characters in the String. Depending on the encoding and the characters, some of them may span more than one byte.

You can either split the String into chunks of 1024 characters and encode those as bytes, but then each chunk may be more than 1024 bytes.

Or you can encode the original string into bytes and then split them into chunks of 1024, but then you have to make sure to append them as bytes before decoding the whole into a String again, or you may get garbled characters at the split points when a character spans more than 1 byte.

If you're worried about memory usage when the String can be very long, you should use streams (java.io package) to to the en/decoding and splitting, in order to avoid keeping the data in memory several times as copies. Ideally, you should avoid having the original String in one piece at all and instead use streams to read it in small chunks from wherever you get it from.

Solution 2

You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:

String encoding = "iso-8859-1";

The fastest way:

ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));

Note that the String is Unicode, so every character needs two bytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).

Now you can read it in 1024 chunks using

byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }

This needs about three times as much RAM as the original String.

A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:

When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.

Code looks like this (untested):

StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter  (buffer, encoding);

char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
    w.write(cbuf, 0, len);
    w.flush();
    if (buffer.size()) >= 1024) {
        tempBuf = buffer.toByteArray();
        ... ready to process one chunk ...
        buffer.reset();
        if (tempBuf.length > 1024) {
            buffer.write(tempBuf, 1024, tempBuf.length - 1024);
        }
    }
}
... check if some data is left in buffer and process that, too ...

This only needs a couple of kilobytes of RAM.

[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:

String safe = new String (array, "iso-8859-1");

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:

int pos = safe.indexOf('\u0000');

This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.

To write the data somewhere, the reverse operation is:

byte[] data = safe.getBytes("iso-8859-1");

Never use the default methods new String(array) or String.getBytes()! One day, your code is going to be executed on a different platform and it will break.

Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.

Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.

Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.

The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."

Important points for beginners:

There is more than one encoding (charset).
There are more characters than the English language uses. There are even several sets of digits (ASCII, full width, Arabic-Indic, Bengali).
You must know which encoding was used to generate the data which you are processing.
You must know which encoding you should use to write the data you are processing.
You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).

The days of ASCII are over.

Solution 3

I know I am late, however I was looking for a solution myself and then found my answer as best answer:

private static String chunk_split(String original, int length, String separator) throws IOException {
    ByteArrayInputStream bis = new ByteArrayInputStream(original.getBytes());
    int n = 0;
    byte[] buffer = new byte[length];
    String result = "";
    while ((n = bis.read(buffer)) > 0) {
        for (byte b : buffer) {
            result += (char) b;
        }
        Arrays.fill(buffer, (byte) 0);
        result += separator;
    }
    return result;
}

Example:

public static void main(String[] args) throws IOException{
       String original = "abcdefghijklmnopqrstuvwxyz";
       System.out.println(chunk_split(original,5,"\n"));
}

Output:

abced
fghij
klmno
pqrst
uvwxy
z

35,908

Author by

user54729

Updated on July 09, 2022

Comments

user54729 almost 2 years

What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.
- mparaz about 15 years
  
  Just checking if you are aware that in Java, Strings are composed of chars and not bytes. A char may be multiple bytes.
- user54729 about 15 years
  
  Thanks I'm very much aware of that. However you can get the corresponding byte[] of a String using String.getBytes(). This is a common problem when for example you want to send the String content over the network.
- zebeurton about 15 years
  
  Why do you need to repeat the header, exactly?
user54729 about 15 years

Would this suffer from the problem that kdgregory was mentioning? That, depending on your platform default encoding, you may split a single character into two meaningless pieces
zebeurton about 15 years

Please don't use "iso-8859-1". Use "utf8". UTF8 handles pretty much all of iso-8859-1 in a single byte, but can scale up to handle all characters. Yes, unknown, this could split a single character into two meaningless pieces...or thrown them away, which is what iso-8859-1 would do.
Aaron Digulla about 15 years

No, because I'm specifying the encoding "iso-8859-1" (which is Latin-1, i.e. ASCII with Umlauts). If your String contains other characters (above codepoint 256), you must use something else here but Latin-1 is usually good because it doesn't change anything.
Aaron Digulla about 15 years

Richard: My guess is that he has binary data in that String in which case iso-8859-1 is perfect (it won't change the data).
Aaron Digulla about 15 years

I improved my answer with some info about the encodings.
user54729 about 15 years

I don't have any binary data in the String. I was actually looking at java.nio.ByteBuffer. It looks promising.
Michael Borgwardt about 15 years

If he has binary data in a String, then unless it's in Base64, he has corrupted data and may as well stop right there.
Aaron Digulla about 15 years

Nope, you can read binary data into a String without problems. A String can contain any character between 0 and 0xffff which covers all binary codes (0-255). Often, a string is more user friendly than a byte[] array. You just need a bit careful when you read/write it :)
Michael Borgwardt about 15 years

Nope, if you do that you're almost certainly end up corrupting your data. It's a horrible abuse that nobody who considers themselves a professional programmer should ever contemplate. Seriously, it's just a very bad idea.
eljenso about 15 years

Putting binary data in a String can get you into trouble. Reminds me of a bug (actually more a design mistake) I had at work with COMP-3 binary COBOL fields in a copybook that were returned into an EBCDIC String, that got converted into ISO-8859-1 at the destination. Result: garbage.
zebeurton about 15 years

@Aaron: I wouldn't want to leave a time bomb in my program, personally; when you finally try to put a Japanese or Chinese string in that 1024 buffer, it's going to blow up and you might not remember why. I wouldn't store binary data in a String either. A short[] if I wanted to deal with unsigned.
Aaron Digulla about 15 years

See my edit. In short: While it is generally a good idea not to mix bytes and Unicode, sometimes, you have to. For example, when decoding XML in a parser, you must read the header as bytes to determine the encoding. Conclusion: If you don't know what you're doing, it's gonna break.
Alan Moore about 15 years

And if you DO know what you're doing, the next guy to touch the code won't, and THEN it will break. This is very bad advice. People have trouble enough dealing with text; encouraging them to mix it with binary data is just plain irresponsible.
Aaron Digulla about 15 years

If every developer would understand how binary data can be handled safely, we wouldn't have this discussion. I explain how it is done correctly and safely since I've never seen that anywhere else (which is probably why most people do it the wrong way which leads to discussions like this one).
Aaron Digulla about 15 years

I understand that you are all afraid of this. Scared me as well. But things like this must be understood or we will never see the end to the errors about which you complain. Wrapping this in red tape won't improve the situation.
Aaron Digulla about 15 years

So while in the general case, it is smart to use one of the Unicode encodings, that won't help the guy who asked the question because he needs bytes. He didn't say why or what for but if he's right, my answer is correct.
Esailija about 11 years

+1. You will not corrupt anything when decoding as ISO-8859-1, you can always encode it back to get exactly the original bytes and then decode as X (EBCDIC for example, or mp3 or anything!). Why cannot more people get this? This was even the way to deal with binary data in Javascript for a long time and I guarantee it works.