Java: remove continious segment of zeros from byte array
Solution 1
regex is not the tool for the job, you will instead need to implement that from scratch
Solution 2
byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");
System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]
I used ISO-8859-1 (latin1) because, unlike any other encoding,
every byte in the range
0x00..0xFF
maps to a valid character, andeach of those characters has the same numeric value as its latin1 encoding.
That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFF
construct, and you can convert the resulting string back to a byte array without losing information.
I wouldn't try to display the data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommend doing this kind of thing at all, but that isn't what you asked. :)
Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.
Solution 3
Though I question whether reg-ex is the right tool for the job, if you do want to use one I'd suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).
public class ByteChars
implements CharSequence
...
ByteChars(byte[] arr) {
this(arr,0,arr.length);
}
ByteChars(byte[] arr, int str, int end) {
//check str and end are within range here
strOfs=str;
endOfs=end;
bytes=arr;
}
public char charAt(int idx) {
//check idx is within range here
return (char)(bytes[strOfs+idx]&0xFF);
}
public int length() {
return (endOfs-strOfs);
}
public CharSequence subSequence(int str, int end) {
//check str and end are within range here
return new ByteChars(arr,(strOfs+str,strOfs+end);
}
public String toString() {
return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
}
Solution 4
I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encoding to encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.
Solution 5
Although there's a reasonable ByteString library floating around, nobody that I've seen has implemented a general regexp library on them.
I recommend solving your problem directly rather than implementing a regexp library :)
If you do convert to string and back, you probably won't find any existing encoding that gives you a round trip for your 0 bytes. If that's the case, you'd have to write your own byte array <-> string converters; not worth the trouble.
Comments
-
Mike almost 2 years
For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes
byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4}; byte r[] = magic(a); System.out.println(r);
result
{1,2,3,0,1,2,3,4}
I want to do something like a regular expression in Java, but on a byte array instead of a String.
Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?
Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?
-
Vinay Sajip over 14 yearsI thought the 3 0s was just an example.
-
Laurent Caillette over 9 yearsThat's really clever.
-
sigpwned about 9 yearsI implemented this approach and it worked a treat! Obviously you have to be careful since you're not performing any charset decoding, but for things like doctype detection it's perfect.
-
try-catch-finally almost 7 yearsBad grammar, no code, either Unicode replacement question marks or counter questions. Hard to understand for people asking such X/Y questions. Downvoted until improved.