Java: remove continious segment of zeros from byte array

15,873

Solution 1

regex is not the tool for the job, you will instead need to implement that from scratch

Solution 2

byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");

System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]

I used ISO-8859-1 (latin1) because, unlike any other encoding,

  • every byte in the range 0x00..0xFF maps to a valid character, and

  • each of those characters has the same numeric value as its latin1 encoding.

That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFF construct, and you can convert the resulting string back to a byte array without losing information.

I wouldn't try to display the data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommend doing this kind of thing at all, but that isn't what you asked. :)

Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.

Solution 3

Though I question whether reg-ex is the right tool for the job, if you do want to use one I'd suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).

public class ByteChars 
implements CharSequence

...

ByteChars(byte[] arr) {
    this(arr,0,arr.length);
    }

ByteChars(byte[] arr, int str, int end) {
    //check str and end are within range here
    strOfs=str;
    endOfs=end;
    bytes=arr;
    }

public char charAt(int idx) { 
    //check idx is within range here
    return (char)(bytes[strOfs+idx]&0xFF); 
    }

public int length() { 
    return (endOfs-strOfs); 
    }

public CharSequence subSequence(int str, int end) { 
    //check str and end are within range here
    return new ByteChars(arr,(strOfs+str,strOfs+end); 
    }

public String toString() { 
    return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
    }

Solution 4

I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encoding to encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.

Solution 5

Although there's a reasonable ByteString library floating around, nobody that I've seen has implemented a general regexp library on them.

I recommend solving your problem directly rather than implementing a regexp library :)

If you do convert to string and back, you probably won't find any existing encoding that gives you a round trip for your 0 bytes. If that's the case, you'd have to write your own byte array <-> string converters; not worth the trouble.

Share:
15,873
Mike
Author by

Mike

I hate computers

Updated on June 19, 2022

Comments

  • Mike
    Mike almost 2 years

    For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes

    byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
    byte r[] = magic(a);
    System.out.println(r);
    

    result

    {1,2,3,0,1,2,3,4}
    

    I want to do something like a regular expression in Java, but on a byte array instead of a String.

    Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?

    Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?

  • Vinay Sajip
    Vinay Sajip over 14 years
    I thought the 3 0s was just an example.
  • Laurent Caillette
    Laurent Caillette over 9 years
    That's really clever.
  • sigpwned
    sigpwned about 9 years
    I implemented this approach and it worked a treat! Obviously you have to be careful since you're not performing any charset decoding, but for things like doctype detection it's perfect.
  • try-catch-finally
    try-catch-finally almost 7 years
    Bad grammar, no code, either Unicode replacement question marks or counter questions. Hard to understand for people asking such X/Y questions. Downvoted until improved.