Binary run length encoding

15,403

Solution 1

Since you're coding bits, you probably want to use a bit-based RLE instead of a byte-based one. In this context, you should consider Elias gamma coding (or some variant thereof) to efficiently encode your run lengths.

A reasonable first approximation for your encoding format might be:

  • first bit = same as the first bit of the uncompressed string (to set initial polarity)
  • remaining bits: Elias coded lengths of successive bit runs (alternating 1 and 0)

Since you know how many bits are in your uncompressed string, you don't need a termination code; you can just add any necessary binary padding as arbitrary bits.

Note that it is always possible for the run-length "compression" to expand your bit string; if you're concerned about this, you can add another initial bit to indicate whether your data is in compressed or uncompressed format, limiting your compression overhead to 1 bit.

Solution 2

264 bit, that are just 33 byte, and that are in base64 just 44 byte. I think this (very small) amount of information is hardly compressable. The sparse representation nulvinge refers too just stores the non zero elements and their values (as you have just 0/1), i.e. in your case just the index of the non zero bits. But as you have 264 possible bits - you need 9 bits for the index, which means, in case you have more than 29 non zero entries, you need already more than original.

Maybe your question is formulated wrong, but I dont see how 264 bits can lead to an intimidating base64 string (How do you generate it - maybe you translate not the 264 bits, but 264 ASCIIs chars (with the value 0 and 1) - that would explain your long result string?).

Share:
15,403
avramov
Author by

avramov

:-)

Updated on June 19, 2022

Comments

  • avramov
    avramov almost 2 years

    I have a web form, for the contents of which I would like to generate a short representation in Base64. The form, among other things, contains a list of 264 binary values, the greater part of which are going to be 0 at any single time. (They represent regions on a geographical map). Even in Base64, this 264-bit number generates a long, intimidating string. I want to implement run-length encoding, as efficiently as possible. Can you help me with this? I've googled binary RLE, but have found nothing of use.

    What I've tried this far - running RLE on the binary string using decimal counts and "A" as a separator denoting a change between 0 and 1, then converting the result from base 11 to base 64. For example:

    00000000001111111000000010000000000000000000000001111111110001111010101000000000000000000000000000000000000111111111110111000000000000111111100000001000000000000000000000000111111111000111101010100000000000000000000000000000000000011111111111011100
    

    becomes

    10A5A5AA22A7A1A2AAAAAAA34A9AA1A10A5A5AA22A7A1A2AAAAAAA34A9AA1A
    

    which in turn becomes

    CNnbr/FxkgbbOw0LNAKgk65P8SdvaTG+t74o
    

    or, in base 62,

    6imo7zq1pqr2mqglTHzXwJRAksm7fvHZHWQK
    

    It's better, but I still can't help but doubt if I'm doing something wrong - is using the digit "A" as a separator is the best way to do this?

    And another update:

    Thanks to @comingstorm, I have shortened the compressed string some more.

    ILHHASCAASBYwwccDASYgAEgWDI=
    

    As I mentioned it in the comments, real usage cases would generally result in an even shorter string.

  • flolo
    flolo over 12 years
    @egasimus: you are right, I calculated with word size. But the general statement holds even for 44 bytes: too small to compress it well.
  • avramov
    avramov over 12 years
    Well, I've managed to shave off a few letters, see above. The binary number above is actually quite an unlikely piece of input data. It would get shorter in more realistic cases. I'm not sure about using the "A" though.