How to compress URL parameters

38,605

Solution 1

A working solution putting various bits of good (or so I think) ideas together

I did this for fun, mainly because it gave me an opportunity to implement an Huffman encoder in PHP and I could not find a satisfactory existing implementation.

However, this might save you some time if you plan to explore a similar path.

Burrows-Wheeler+move-to-front+Huffman transform

I'm not quite sure BWT would be best suited for your kind of input.
This is no regular text, so recurring patterns would probably not occur as often as in source code or plain English.

Besides, a dynamic Huffman code would have to be passed along with the encoded data which, for very short input strings, would harm the compression gain badly.

I might well be wrong, in which case I would gladly see someone prove me to be.

Anyway, I decided to try another approach.

General principle

1) define a structure for your URL parameters and strip the constant part

for instance, starting from:

repos=aaa,bbb,ccc&
labels=ddd,eee,fff&
milestones=ggg,hhh,iii&
username=kkk&
show_open=0&
show_closed=1&
show_commented=1&
show_uncommented=0

extract:

aaa,bbb,ccc|ddd,eee,fff|ggg,hhh,iii|kkk|0110

where , and | act as string and/or field terminators, while boolean values don't need any.

2) define a static repartition of symbols based on the expected average input and derive a static Huffman code

Since transmitting a dynamic table would take more space than your initial string, I think the only way to achhieve any compression at all is to have a static huffman table.

However, you can use the structure of your data to your advantage to compute reasonable probabilities.

You can start with the repartition of letters in English or other languages and throw in a certain percentage of numbers and other punctuation signs.

Testing with a dynamic Huffman coding, I saw compression rates of 30 to 50%.

This means with a static table you can expect maybe a .6 compression factor (reducing the lenght of your data by 1/3), not much more.

3) convert this binary Huffmann code into something an URI can handle

The 70 regular ASCII 7 bits chars in that list

!'()*-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

would give you an expansion factor of about 30%, practically no better than a base64 encode.

A 30% expansion would ruin the gain from a static Huffman compression, so this is hardly an option!

However, since you control the encoding client and server side, you can use about anything that is not an URI reserved character.

An interesting possiblity would be to complete the above set up to 256 with whatever unicode glyphs, which would allow to encode your binary data with the same number of URI-compliant characters, thus replacing a painful and slow bunch of long integer divisions with a lightning fast table lookup.

Structure description

The codec is meant to be used both client and server side, so it is essential that server and clients share a common data structure definition.

Since the interface is likely to evolve, it seems wise to store a version number for upward compatibility.

The interface definition will use a very minimalistic description language, like so:

v   1               // version number (between 0 and 63)
a   en              // alphabet used (English)
o   10              // 10% of digits and other punctuation characters
f   1               // 1% of uncompressed "foreign" characters
s 15:3 repos        // list of expeced 3 strings of average length 15
s 10:3 labels
s 8:3  milestones
s 10   username     // single string of average length 10
b      show_open    // boolean value
b      show_closed
b      show_commented
b      show_uncommented

Each language supported will have a frequency table for all its used letters

digits and other computerish symbols like -, . or _ will have a global frequency, regardless of languages

separators (, and |) frequencies will be computed according to the number of lists and fields present in the structure.

All other "foreign" characters will be escaped with a specific code and encoded as plain UTF-8.

Implementation

The bidirectional conversion path is as follows:

list of fields <-> UTF-8 data stream <-> huffman codes <-> URI

Here is the main codec

include ('class.huffman.codec.php');
class IRI_prm_codec
{
    // available characters for IRI translation
    static private $translator = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅ";

    const VERSION_LEN = 6; // version number between 0 and 63

    // ========================================================================
    // constructs an encoder
    // ========================================================================
    public function __construct ($config)
    {
        $num_record_terminators = 0;
        $num_record_separators = 0;
        $num_text_sym = 0;

        // parse config file
        $lines = file($config, FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($lines as $line)
        {
            list ($code, $val) = preg_split('/\s+/', $line, 2);
            switch ($code)
            {
            case 'v': $this->version = intval($val); break;
            case 'a': $alphabet = $val; break;
            case 'o': $percent_others = $val; break;
            case 'f': $percent_foreign = $val; break;
            case 'b':
                $this->type[$val] = 'b';
                break;
            case 's':
                list ($val, $field) = preg_split('/\s+/u', $val, 2);
                @list ($len,$num) = explode (':', $val);
                if (!$num) $num=1;
                $this->type[$field] = 's';
                $num_record_terminators++;
                $num_record_separators+=$num-1;
                $num_text_sym += $num*$len;
                break;

            default: throw new Exception ("Invalid config parameter $code");
            }
        }

        // compute symbol frequencies           
        $total = $num_record_terminators + $num_record_separators + $num_text_sym + 1;

        $num_chars = $num_text_sym * (100-($percent_others+$percent_foreign))/100;
        $num_sym = $num_text_sym * $percent_others/100;
        $num_foreign = $num_text_sym * $percent_foreign/100;

        $this->get_frequencies ($alphabet, $num_chars/$total);
        $this->set_frequencies (" .-_0123456789", $num_sym/$total);
        $this->set_frequencies ("|", $num_record_terminators/$total);
        $this->set_frequencies (",", $num_record_separators/$total);
        $this->set_frequencies ("\1", $num_foreign/$total);
        $this->set_frequencies ("\0", 1/$total);

        // create Huffman codec
        $this->huffman = new Huffman_codec();
        $this->huffman->make_code ($this->frequency);
    }

    // ------------------------------------------------------------------------
    // grab letter frequencies for a given language
    // ------------------------------------------------------------------------
    private function get_frequencies ($lang, $coef)
    {
        $coef /= 100;
        $frequs = file("$lang.dat", FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($frequs as $line)
        {
            $vals = explode (" ", $line);
            $this->frequency[$vals[0]] = floatval ($vals[1]) * $coef;
        }
    }

    // ------------------------------------------------------------------------
    // set a given frequency for a group of symbols
    // ------------------------------------------------------------------------
    private function set_frequencies ($symbols, $coef)
    {
        $coef /= strlen ($symbols);
        for ($i = 0 ; $i != strlen($symbols) ; $i++) $this->frequency[$symbols[$i]] = $coef;
    }

    // ========================================================================
    // encodes a parameter block
    // ========================================================================
    public function encode($input)
    {
        // get back input values
        $bools = '';
        foreach (get_object_vars($input) as $prop => $val)
        {
            if (!isset ($this->type[$prop])) throw new Exception ("unknown property $prop");
            switch ($this->type[$prop])
            {
            case 'b': $bools .= $val ? '1' : '0'; break;
            case 's': $strings[] = $val; break;
            default: throw new Exception ("Uh oh... type ".$this->type[$prop]." not handled ?!?");
            }
        }

        // set version number and boolean values in front
        $prefix = sprintf ("%0".self::VERSION_LEN."b$bools", $this->version);

        // pass strings to our Huffman encoder
        $strings = implode ("|", $strings);
        $huff = $this->huffman->encode ($strings, $prefix, "UTF-8");

        // translate into IRI characters
        mb_internal_encoding("UTF-8");
        $res = '';
        for ($i = 0 ; $i != strlen($huff) ; $i++) $res .= mb_substr (self::$translator, ord($huff[$i]), 1);

        // done
        return $res;
    }

    // ========================================================================
    // decodes an IRI string into a lambda object
    // ========================================================================
    public function decode($input)
    {
        // convert IRI characters to binary
        mb_internal_encoding("UTF-8");
        $raw = '';
        $len = mb_strlen ($input);
        for ($i = 0 ; $i != $len ; $i++)
        {
            $c = mb_substr ($input, 0, 1);
            $input = mb_substr ($input, 1);
            $raw .= chr(mb_strpos (self::$translator, $c));
        }

        $this->bin = '';        

        // check version
        $version = $this->read_bits ($raw, self::VERSION_LEN);
        if ($version != $this->version) throw new Exception ("Version mismatch: expected {$this->version}, found $version");

        // read booleans
        foreach ($this->type as $field => $type)
            if ($type == 'b')
                $res->$field = $this->read_bits ($raw, 1) != 0;

        // decode strings
        $strings = explode ('|', $this->huffman->decode ($raw, $this->bin));
        $i = 0;
        foreach ($this->type as $field => $type) 
            if ($type == 's')
                $res->$field = $strings[$i++];

        // done
        return $res;
    }

    // ------------------------------------------------------------------------
    // reads raw bit blocks from a binary string
    // ------------------------------------------------------------------------
    private function read_bits (&$raw, $len)
    {
        while (strlen($this->bin) < $len)
        {
            if ($raw == '') throw new Exception ("premature end of input"); 
            $this->bin .= sprintf ("%08b", ord($raw[0]));
            $raw = substr($raw, 1);
        }
        $res = bindec (substr($this->bin, 0, $len));
        $this->bin = substr ($this->bin, $len);
        return $res;
    }
}

The underlying Huffman codec

include ('class.huffman.dict.php');

class Huffman_codec
{
    public  $dict = null;

    // ========================================================================
    // encodes a string in a given string encoding (default: UTF-8)
    // ========================================================================
    public function encode($input, $prefix='', $encoding="UTF-8")
    {
        mb_internal_encoding($encoding);
        $bin = $prefix;
        $res = '';
        $input .= "\0";
        $len = mb_strlen ($input);
        while ($len--)
        {
            // get next input character
            $c = mb_substr ($input, 0, 1);
            $input = substr($input, strlen($c)); // avoid playing Schlemiel the painter

            // check for foreign characters
            if (isset($this->dict->code[$c]))
            {
                // output huffman code
                $bin .= $this->dict->code[$c];
            }
            else // foreign character
            {
                // escape sequence
                $lc = strlen($c);
                $bin .= $this->dict->code["\1"] 
                     . sprintf("%02b", $lc-1); // character length (1 to 4)

                // output plain character
                for ($i=0 ; $i != $lc ; $i++) $bin .= sprintf("%08b", ord($c[$i]));
            }

            // convert code to binary
            while (strlen($bin) >= 8)
            {
                $res .= chr(bindec(substr ($bin, 0, 8)));
                $bin = substr($bin, 8);
            }
        }

        // output last byte if needed
        if (strlen($bin) > 0)
        {
            $bin .= str_repeat ('0', 8-strlen($bin));
            $res .= chr(bindec($bin));
        }

        // done
        return $res;
    }

    // ========================================================================
    // decodes a string (will be in the string encoding used during encoding)
    // ========================================================================
    public function decode($input, $prefix='')
    {
        $bin = $prefix;
        $res = '';
        $len = strlen($input);
        for ($i=0 ;;)
        {
            $c = $this->dict->symbol($bin);

            switch ((string)$c)
            {
            case "\0": // end of input
                break 2;

            case "\1": // plain character

                // get char byte size
                if (strlen($bin) < 2)
                {
                    if ($i == $len) throw new Exception ("incomplete escape sequence"); 
                    $bin .= sprintf ("%08b", ord($input[$i++]));
                }
                $lc = 1 + bindec(substr($bin,0,2));
                $bin = substr($bin,2);
                // get char bytes
                while ($lc--)
                {
                    if ($i == $len) throw new Exception ("incomplete escape sequence"); 
                    $bin .= sprintf ("%08b", ord($input[$i++]));
                    $res .= chr(bindec(substr($bin, 0, 8)));
                    $bin = substr ($bin, 8);
                }
                break;

            case null: // not enough bits do decode further

                // get more input
                if ($i == $len) throw new Exception ("no end of input mark found"); 
                $bin .= sprintf ("%08b", ord($input[$i++]));
                break;

            default:  // huffman encoded

                $res .= $c;
                break;          
            }
        }

        if (bindec ($bin) != 0) throw new Exception ("trailing bits in input");
        return $res;
    }

    // ========================================================================
    // builds a huffman code from an input string or frequency table
    // ========================================================================
    public function make_code ($input, $encoding="UTF-8")
    {
        if (is_string ($input))
        {
            // make dynamic table from the input message
            mb_internal_encoding($encoding);
            $frequency = array();
            while ($input != '')
            {
                $c = mb_substr ($input, 0, 1);
                $input = mb_substr ($input, 1);
                if (isset ($frequency[$c])) $frequency[$c]++; else $frequency[$c]=1;
            }
            $this->dict = new Huffman_dict ($frequency);
        }
        else // assume $input is an array of symbol-indexed frequencies
        {
            $this->dict = new Huffman_dict ($input);
        }
    }
}

And the huffman dictionary

class Huffman_dict
{
    public  $code = array();

    // ========================================================================
    // constructs a dictionnary from an array of frequencies indexed by symbols
    // ========================================================================
    public function __construct ($frequency = array())
    {
        // add terminator and escape symbols
        if (!isset ($frequency["\0"])) $frequency["\0"] = 1e-100;
        if (!isset ($frequency["\1"])) $frequency["\1"] = 1e-100;

        // sort symbols by increasing frequencies
        asort ($frequency);

        // create an initial array of (frequency, symbol) pairs
        foreach ($frequency as $symbol => $frequence) $occurences[] = array ($frequence, $symbol);

        while (count($occurences) > 1)
        {
            $leaf1 = array_shift($occurences);
            $leaf2 = array_shift($occurences);
            $occurences[] = array($leaf1[0] + $leaf2[0], array($leaf1, $leaf2));
            sort($occurences);
        }
        $this->tree = $this->build($occurences[0], '');

    }

    // -----------------------------------------------------------
    // recursive build of lookup tree and symbol[code] table
    // -----------------------------------------------------------
    private function build ($node, $prefix)
    {
        if (is_array($node[1]))
        {
            return array (
                '0' => $this->build ($node[1][0], $prefix.'0'),
                '1' => $this->build ($node[1][1], $prefix.'1'));
        }
        else
        {
            $this->code[$node[1]] = $prefix;
            return $node[1];
        }
    }

    // ===========================================================
    // extracts a symbol from a code stream
    // if found     : updates code stream and returns symbol
    // if not found : returns null and leave stream intact
    // ===========================================================
    public function symbol(&$code_stream)
    {
        list ($symbol, $code) = $this->get_symbol ($this->tree, $code_stream);
        if ($symbol !== null) $code_stream = $code;
        return $symbol;
    }

    // -----------------------------------------------------------
    // recursive search for a symbol from an huffman code
    // -----------------------------------------------------------
    private function get_symbol ($node, $code)
    {
        if (is_array($node))
        {
            if ($code == '') return null;
            return $this->get_symbol ($node[$code[0]], substr($code, 1));
        }
        return array ($node, $code);
    }
}

Example

include ('class.iriprm.codec.php');

$iri = new IRI_prm_codec ("config.txt");
foreach (array (
    'repos' => "discussion,documentation,hoodie-cli",
    'labels' => "enhancement,release-0.3.0,starter",
    'milestones' => "1.0.0,1.1.0,v0.7",
    'username' => "mklappstuhl",
    'show_open' => false,
    'show_closed' => true,
    'show_commented' => true,
    'show_uncommented' => false
) as $prop => $val) $iri_prm->$prop = $val;

$encoded = $iri->encode ($iri_prm);
echo "encoded as $encoded\n";
$decoded = $iri->decode ($encoded);
var_dump($decoded);

output:

encoded as 5ĶůťÊĕCOĔƀŪļŤłmĄZEÇŽÉįóšüÿjħũÅìÇēOĪäŖÏŅíŻÉĒQmìFOyäŖĞqæŠŹōÍĘÆŤŅËĦ

object(stdClass)#7 (8) {
  ["show_open"]=>
  bool(false)
  ["show_closed"]=>
  bool(true)
  ["show_commented"]=>
  bool(true)
  ["show_uncommented"]=>
  bool(false)
  ["repos"]=>
  string(35) "discussion,documentation,hoodie-cli"
  ["labels"]=>
  string(33) "enhancement,release-0.3.0,starter"
  ["milestones"]=>
  string(16) "1.0.0,1.1.0,v0.7"
  ["username"]=>
  string(11) "mklappstuhl"
}

In that example, the input got packed into 64 unicode characters, for an input length of about 100, yielding a 1/3 reduction.

An equivalent string:

discussion,documentation,hoodie-cli|enhancement,release-0.3.0,starter|
1.0.0,1.1.0,v0.7|mklappstuhl|0110

Would be compressed by a dynamic Huffman table to 59 characters. Not much of a difference.

No doubt smart data reordering would reduce that, but then you would need to pass the dynamic table along...

Chinese to the rescue?

Drawing on ttepasse's idea, one could take advantage of the huge number of Asian characters to find a range of 0x4000 (12 bits) contiguous values, to code 3 bytes into 2 CJK characters, like so:

    // translate into IRI characters
    $res = '';
    $len = strlen ($huff);
    for ($i = 0 ; $i != $len ; $i++)
    {
        $byte = ord($huff[$i]);
        $quartet[2*$i  ] = $byte >> 4;
        $quartet[2*$i+1] = $byte &0xF;
    }
    $len *= 2;
    while ($len%3 != 0) $quartet[$len++] = 0;
    $len /= 3;
    for ($i = 0 ; $i != $len ; $i++)
    {
        $utf16 = 0x4E00 // CJK page base, enough range for 2**12 (0x4000) values
               + ($quartet[3*$i+0] << 8)
               + ($quartet[3*$i+1] << 4)
               + ($quartet[3*$i+2] << 0);
        $c = chr ($utf16 >> 8) . chr ($utf16 & 0xFF);
        $res .= $c;
    }
    $res = mb_convert_encoding ($res, "UTF-8", "UTF-16");

and back:

    // convert IRI characters to binary
    $input = mb_convert_encoding ($input, "UTF-16", "UTF-8");
    $len = strlen ($input)/2;
    for ($i = 0 ; $i != $len ; $i++)
    {
        $val = (ord($input[2*$i  ]) << 8) + ord ($input[2*$i+1]) - 0x4E00;
        $quartet[3*$i+0] = ($val >> 8) &0xF;
        $quartet[3*$i+1] = ($val >> 4) &0xF;
        $quartet[3*$i+2] = ($val >> 0) &0xF;
    }
    $len *= 3;
    while ($len %2) $quartet[$len++] = 0;
    $len /= 2;
    $raw = '';
    for ($i = 0 ; $i != $len ; $i++)
    {
        $raw .= chr (($quartet[2*$i+0] << 4) + $quartet[2*$i+1]);
    }

The previous output of 64 Latin chars

5ĶůťÊĕCOĔƀŪļŤłmĄZEÇŽÉįóšüÿjħũÅìÇēOĪäŖÏŅíŻÉĒQmìFOyäŖĞqæŠŹōÍĘÆŤŅËĦ

would "shrink" to 42 Asian characters:

乙堽孴峴勀垧壩坸冫嚘佰嫚凲咩俇噱刵巋娜奾埵峼圔奌夑啝啯嶼勲婒婅凋凋伓傊厷侖咥匄冯塱僌

However, as you can see, the sheer bulk of your average ideogram makes the string actually longer (pixel-wise), so even if the idea was promising, the outcome is rather disappointing.

Picking thinner glyphs

On the other hand, you can try to pick "thin" characters as a base for URI encoding. For instance:

█ᑊᵄ′ӏᶟⱦᵋᵎiïᵃᶾ᛬ţᶫꞌᶩ᠇܂اlᶨᶾᛁ⁚ᵉʇȋʇίן᠙ۃῗᥣᵋĭꞌ៲ᛧ༚ƫܙ۔ˀȷˁʇʹĭ∕ٱ;łᶥյ;ᴶ⁚ĩi⁄ʈ█

instead of

█5ĶůťÊĕCOĔƀŪļŤłmĄZEÇŽÉįóšüÿjħũÅìÇēOĪäŖÏŅíŻÉĒQmìFOyäŖĞqæŠŹōÍĘÆŤŅËĦ█

That will shrink the length by half with proportional fonts, including in a browser address bar.

My best candidate set of 256 "thin" glyphs so far:

᠊།ᑊʲ་༌ᵎᵢᶤᶩᶪᶦᶧˡ ⁄∕เ'Ꞌꞌ꡶ᶥᵗᶵᶨ|¦ǀᴵ  ᐧᶠᶡ༴ˢᶳ⁏ᶴʳʴʵ։᛬⍮ʹ′ ⁚⁝ᵣ⍘༔⍿ᠵᥣᵋᵌᶟᴶǂˀˁˤ༑,.   ∙Ɩ៲᠙ᵉᵊᵓᶜᶝₑₔյⵏⵑ༝༎՛ᵞᵧᚽᛁᛂᛌᛍᛙᛧᶢᶾ৷⍳ɩΐίιϊᵼἰἱἲἳἴἵἶἷὶίῐῑῒΐῖῗ⎰⎱᠆ᶿ՝ᵟᶫᵃᵄᶻᶼₐ∫ª౹᠔/:;\ijltìíîïĩīĭįıĵĺļłţŧſƚƫƭǐǰȉȋțȴȷɉɨɪɫɬɭʇʈʝːˑ˸;·ϳіїјӏ᠇ᴉᵵᵻᶅᶖḭḯḷḹḻḽṫṭṯṱẗẛỉị⁞⎺⎻⎼⎽ⱡⱦ꞉༈ǁ‖༅༚ᵑᵝᵡᵦᵪา᠑⫶ᶞᚁᚆᚋᚐᚕᵒᵔᵕᶱₒⵗˣₓᶹๅʶˠ᛫ᵛᵥᶺᴊ

Conclusion

This implementation should be ported to JavaScript to allow client-server exchange.
You should also provide a way to share the structure and Huffman codes with the clients.

It is not difficult and rather fun to do, but that means even more work :).

The Huffman gain in term of characters is around 30%.

Of course these characters are multibyte for the most part, but if you aim for the shortest URI it does not matter.
Except for the booleans that can easily be packed to 1 bit, those pesky strings seem rather reluctant to be compressed.
It might be possible to better tune the frequencies, but I doubt you will get above 50% compression rate.

On the other hand, picking thin glyphs does actually more to shrink the string.

So all in all the combination of both might indeed achieve something, though it's a lot of work for a modest result.

Solution 2

Just as you yourself propose, I would first get rid of all the characters that are not carrying any information, because they are part of the "format".

E.g. turn "labels=open,ssl,cypher&repository=275643&username=ryanbrg&milestones=&with_comment=yes" to "open,ssl,cyper|275643|ryanbrg||yes".

Then use a Huffmann encoding with a fixed probability vector (resulting in a fixed mapping from characters to variable length bitstrings - with the most probable characters mapped to shorter bitstrings and less probable characters mapped to longer bitstrings).

You could even use different probability vectors for the different parameters. For example in the parameter "labels" the alpha characters will have high probability, but in the "repository" parameter the numeric characters will have the highest probability. If you do this, you should consider the separator "|" a part of the preceeding parameter.

And finally turn the long bitstring (which is the concatenation all the bitstrings to which the characters were mapped) into something you can put into an URL by base64url encoding it.

If you could send me a set of representative parameter lists, I could run them through a Huffmann coder to see how well they compress.

The probability vector (or equivalently the mapping from characters to bitstrings) should be encoded as constant arrays into the Javascript function that is sent to the browser.

Of course you could go even further and - for example - try to get a list of possible lables with their probabilities. Then you could map entire lables to bitstrings with a Huffmann encoding. This will give you better compression, but you will have extra work for those labels that are new (e.g. falling back to the single character encoding), and of course the mapping (which - as mentioned above - is a constant array in the Javascript function) will be much larger.

Solution 3

Why not using protocol-buffers?

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

ProtoBuf.js converts objects to protocol buffer messages and vice vera.

The following object converts to: CgFhCgFiCgFjEgFkEgFlEgFmGgFnGgFoGgFpIgNqZ2I=

{
    repos : ['a', 'b', 'c'],
    labels: ['d', 'e', 'f'],
    milestones : ['g', 'h', 'i'],
    username : 'jgb'
}

Example

The following example is built using require.js. Give it a try on this jsfiddle.

require.config({
    paths : {
        'Math/Long'  : '//rawgithub.com/dcodeIO/Long.js/master/Long.min',
        'ByteBuffer' : '//rawgithub.com/dcodeIO/ByteBuffer.js/master/ByteBuffer.min',
        'ProtoBuf'   : '//rawgithub.com/dcodeIO/ProtoBuf.js/master/ProtoBuf.min'
    }
})

require(['message'], function(message) {
    var data = {
        repos : ['a', 'b', 'c'],
        labels: ['d', 'e', 'f'],
        milestones : ['g', 'h', 'i'],
        username : 'jgb'
    }

    var request = new message.arguments(data);

    // Convert request data to base64
    var base64String = request.toBase64();
    console.log(base64String);

    // Convert base64 back
    var decodedRequest = message.arguments.decode64(base64String);
    console.log(decodedRequest);
});

// Protobuf message definition
// Message definition could also be stored in a .proto definition file
// See: https://github.com/dcodeIO/ProtoBuf.js/wiki
define('message', ['ProtoBuf'], function(ProtoBuf) {
    var proto = {
        package : 'message',
        messages : [
            {
                name : 'arguments',
                fields : [
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'repos',
                        id : 1
                    },
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'labels',
                        id : 2
                    },
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'milestones',
                        id : 3
                    },
                    {
                        rule : 'required',
                        type : 'string',
                        name : 'username',
                        id : 4
                    },
                    {
                        rule : 'optional',
                        type : 'bool',
                        name : 'with_comments',
                        id : 5
                    },
                    {
                        rule : 'optional',
                        type : 'bool',
                        name : 'without_comments',
                        id : 6
                    }
                ],
            }
        ]
    };

    return ProtoBuf.loadJson(proto).build('message')
});

Solution 4

I have a cunning plan! (And a drink of gin tonic)

You doesn't seem to care about the length of the bytestream but of the length of the resulting glyphs, e.g. what the string which is displayed to the user.

Browser are pretty good in converting an IRI to the underlying [URI][2] while still displaying the IRI in the address bar. IRIs have a greater repertoire of possible characters while your set of possible chars is rather limited.

That means you can encode bigrams of your chars (aa, ab, ac, …, zz & special chars) into one char of the full unicode spectrum. Say you've got 80 possible ASCII chars: the number of possible combinations of two chars is 6400. Which are easy findable in Unicodes assigned chars, e.g. in the han unified CJK spectrum:

aa  →  一
ab  →  丁
ac  →  丂
ad  →  七
…

I picked CJK because this is only (slighty) reasonable if the target chars are assigned in unicode and have assigned glyphs on the major browser and operating systems. For that reason the private use area is out and the more efficient version using trigrams (whose possible combinations could use all of Unicodes 1114112 possible code points) are out.

To recap: the underlying bytes are still there and – given UTF-8 encoding – possible even longer, but the string of displayed characters the user sees and copies is 50% shorter.

Ok, Ok, reasons, why this solution is insane:

  • IRIs are not perfect. A lot of lesser tools than modern browser have their problems.

  • The algorithm needs obviously a lot of more work. You'll need a function which maps the bigrams to the target chars and back. And it should preferable work arithmetically to avoid big hash tables in memory.

  • The target chars should be checked if they are assigned and if they are simple chars and not fancy unicodian things like combining chars or stuff that got lost somewhere in Unicode normalization. Also if the target area is an continuous span of assigned chars with glyphs.

  • Browser are sometimes wary of IRIs. For good reason, given the IDN homograph attacks. Are they OK with all these non-ASCII-chars in their address bar?

  • And the biggest: people are notoriously bad at remembering characters in scripts they don't know. They are even worse at trying to (re)-type these chars. And copy'n'paste can go wrong in many different clicks. There is a reason URL shorteners use Base64 and even smaller alphabets.

… speaking of which: That would be my solution. Offloading the work of shortening links either to the user or integrating goo.gl or bit.ly via their APIs.

Solution 5

Small tip: Both parseInt and Number#toString support radix arguments. Try using a radix of 36 to encode numbers (or indexes into lists) in URLs.

Share:
38,605

Related videos on Youtube

Jan Lehnardt
Author by

Jan Lehnardt

Updated on July 05, 2022

Comments

  • Jan Lehnardt
    Jan Lehnardt almost 2 years

    Say I have a single-page application that uses a third party API for content. The app’s logic is in-browser only; there is no backend I can write to.

    To allow deep-linking into the state of the app, I use pushState() to keep track of a few variables that determine the state of the app. (Note that Ubersicht’s public version doesn’t do this yet.)

    • Variables: repos, labels, milestones, username, show_open (bool), with_comments (bool), and without_comments (bool).
    • URL format: ?label=label_1,label_2,label_3&repos=repo_1….
    • Values: the usual suspects. Roughly, [a-zA-Z][a-zA-Z0-9_-], or any boolean indicator.

    So far so good.

    Now, since the query string can be a bit long and unwieldy and I would like to be able to pass around URLs like http://espy.github.io/ubersicht/?state=SOMOPAQUETOKENTHATLOSSLESSLYDECOMPRESSESINTOTHEORIGINALVALUES#hoodiehq, the shorter the better.

    My first attempt was going to be using some zlib-like algorithm for this. Then @flipzagging pointed to antirez/smaz, which looks more suitable for short strings. (JavaScript version here.)

    Since = and & are not specifically handled in the Javascript version (see line 9 of the main lib file), we might be able to tweak things a little there.

    Furthermore, there is an option for encoding the values in a fixed table. With this option, the order of arguments is pre-defined and all we need to keep track of is the actual value. Example: turn a=hamster&b=cat into 7hamster3cat (length+chars) or hamster|cat (value + |), potentially before the smaz compression.

    Is there anything else I should be looking for?

    • dandavis
      dandavis about 10 years
      packer with base62 encoding is worth a shot. i use deflate() and inflate() a lot, but you'll need to base64 the output of deflate... danml.com/js/compression.js
    • O.O
      O.O about 10 years
      @OP - Could you store the values in a cookie or isolated storage instead of the querystring?
  • Sven
    Sven about 10 years
    I think this solution, while valid in a general case, is ruled out by the restriction of "no backend to write to", no matter how sophisticated the method of communication. :)
  • Sven
    Sven about 10 years
    Bounty awarded for having a real implementation in Javascript that looks like it is shortening data and returning it back.
  • Sven
    Sven about 10 years
    Your post also deserves a bounty - I'll be back.
  • kuroi neko
    kuroi neko about 10 years
    Protocol buffers : 4000 lines of JavaScript. Require.js : 2000 lines of JavaScript. Expansion factor superior to 1. If this is a solution, what was the problem? ;)
  • kuroi neko
    kuroi neko about 10 years
    Thanks. I'm not here for the money anyway :). I could port the relevant bit to JavaScript to make a more suited response, but at any rate I think the JS port of Google's protocol buffers is not the right tool for the job. You would embark literally thousands of lines of JavaScript to swat a fly, and the resulting "compressed" string would actually be longer than your stripped down message payload.
  • jgb
    jgb about 10 years
    @kuroineko require.js is optional. Long.js, ByteBuffer.js and ProtoBuf.js are minified and gziped 20Kb (1 second download time over a 128 Kbps connection). This proposal is straightforward and backward compatible with future changes (old bookmarks).
  • kuroi neko
    kuroi neko about 10 years
    I'm not here to argue, but this compatibility depends on an external port of Google's libraries, which are maintained only for C++, Python and Ruby. In my opinion, the flaw in this solution is to embark everything client-side, while a server-side tool could do 90% of the work and pass only the minimal data to the clients (the parameters structure and the compressed string), reducing the client-side code to a small coder/decoder (and more efficient than base64 encode, or else the "compressed" string would actually be longer than the raw data!).
  • kuroi neko
    kuroi neko over 8 years
    Really curious about the downvote... Is this post violating a law?
  • Sven
    Sven over 8 years
    Here's your bounty. Not sure 200 is too much, but SO let's you only award in incremental steps, so no 100 additional bounty. On the other hand, your answer got significantly more upvotes than the other one, and I take it as a positive vote from the audience.
  • kuroi neko
    kuroi neko over 8 years
    Haha better late than never I suppose. I'm glad it helped you. Never took the time to produce the client code, but if you really want to use it I could have a look.
  • RichieRock
    RichieRock over 6 years
    I like this implementation! Tried it in our project and seems to be working nicely. Now just need to get it working with angular routing :)
  • rraallvv
    rraallvv almost 6 years
    Any JS implementation of this?
  • Sachin Singh
    Sachin Singh over 5 years
    Is it compatible with IE9?
  • Anantha Kumaran
    Anantha Kumaran over 5 years
    I haven't tested it explicitly with IE9, but it should work as the implementation doesn't depend on any browser-specific api.
  • Ryan
    Ryan almost 4 years
    jsfiddle is broken. Get a 404 for rawgit.com/dcodeIO/protobuf.js/master/dist/…