Which is the best way to compress json to store in a memory based store like redis or memcache?

32,413

Solution 1

We just use gzip as a compressor.

import gzip
import cStringIO

def decompressStringToFile(value, outputFile):
  """
  decompress the given string value (which must be valid compressed gzip
  data) and write the result in the given open file.
  """
  stream = cStringIO.StringIO(value)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      outputFile.close()
      return 
    outputFile.write(chunk)

def compressFileToString(inputFile):
  """
  read the given open file, compress the data and return it as string.
  """
  stream = cStringIO.StringIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = inputFile.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

In our usecase we store the result as files, as you can imagine. To use just in-memory strings, you can use a cStringIO.StringIO() object as a replacement for the file as well.

Solution 2

Based on @Alfe's answer above here is a version that keeps the contents in memory (for network I/O tasks). I also made a few changes to support Python 3.

import gzip
from io import StringIO, BytesIO

def decompressBytesToString(inputBytes):
  """
  decompress the given byte array (which must be valid 
  compressed gzip data) and return the decoded text (utf-8).
  """
  bio = BytesIO()
  stream = BytesIO(inputBytes)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      bio.seek(0)
      return bio.read().decode("utf-8")
    bio.write(chunk)
  return None

def compressStringToBytes(inputString):
  """
  read the given string, encode it in utf-8,
  compress the data and return it as a byte array.
  """
  bio = BytesIO()
  bio.write(inputString.encode("utf-8"))
  bio.seek(0)
  stream = BytesIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = bio.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

To test the compression try:

inputString="asdf" * 1000
len(inputString)
len(compressStringToBytes(inputString))
decompressBytesToString(compressStringToBytes(inputString))

Solution 3

I did some extensive comparisons between different binary formats (MessagePack, BSON, Ion, Smile CBOR) and compression algorithms (Brotli, Gzip, XZ, Zstandard, bzip2).

For the JSON data I used for testing, keeping the data as JSON and using Brotli compression was the best solution. Brotli has different compression levels, so if you are persisting the data for a long period of time, then using a high level of compression can be worth it. If you are not persisting for very long, then a lower level of compression or using Zstandard might be most effective.

Gzip is easy, but there are almost certainly going to be alternatives that are either quicker, or offer better compression, or both.

You can read the full details of our investigation here: Blog Post

Solution 4

If you want it to be fast, try lz4. If you want it to compress better, go for lzma.

Are there any other better ways to compress json to save memory in redis(also ensuring light weight decoding afterwards)?

How good a candidate would be msgpack [http://msgpack.org/]?

Msgpack is relatively fast and has a smaller memory footprint. But ujson is generally faster for me. You should compare them on your data, measure the compression and decompression rates and the compression ratio.

Shall I consider options like pickle as well?

Consider both pickle(cPickle in partucular) and marshal. They are fast. But remember that they are not secure or scalable and you pay for the speed with the added responsibility.

Solution 5

One easy "post process" way is to build a "short key name" map and run the generated json through that before storage, and again (reversed) before de-serializing to an object. For example:

Before: {"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}
Map: details:d, age:a, name:n, members:m
Result: {"d":{"1":{"a":13,"n":"dhruv"},"2":{"a":15,"n":"Matt"}},"m":["1","2"]}

Just go through the json and replace key->value on the way to the database, and value->key on the way to the application.

You can also gzip for extra goodness (won't be a string after that though).

Share:
32,413

Related videos on Youtube

DhruvPathak
Author by

DhruvPathak

Always keep on improving.

Updated on July 20, 2022

Comments

  • DhruvPathak
    DhruvPathak almost 2 years

    Requirement : Python objects with 2-3 levels of nesting containing basic datypes like integers,strings, lists, and dicts. ( no dates etc), needs to be stored as json in redis against a key. What are the best methods available for compressing json as a string for low memory footprint. The target objects are not very large, having 1000 small elements on average, or about 15000 characters when converted to JSON.

    eg.

    >>> my_dict
    {'details': {'1': {'age': 13, 'name': 'dhruv'}, '2': {'age': 15, 'name': 'Matt'}}, 'members': ['1', '2']}
    >>> json.dumps(my_dict)
    '{"details": {"1": {"age": 13, "name": "dhruv"}, "2": {"age": 15, "name": "Matt"}}, "members": ["1", "2"]}'
    ### SOME BASIC COMPACTION ###
    >>> json.dumps(my_dict, separators=(',',':'))
    '{"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}'
    

    1/ Are there any other better ways to compress json to save memory in redis ( also ensuring light weight decoding afterwards ).

    2/ How good a candidate would be msgpack [http://msgpack.org/] ?

    3/ Shall I consider options like pickle as well ?

    • drekyn
      drekyn about 11 years
      what are the requirements of your application? do you need performance? reliability, consistency, etc? would you consider alternatives to redis?
  • DhruvPathak
    DhruvPathak about 11 years
    I do not think BSON can be added as a value for a key in redis.
  • Jonatan Hedborg
    Jonatan Hedborg about 11 years
    @DhruvPathak sure it can, why wouldn't it? Redis has no opinion on what you store in a key.
  • DhruvPathak
    DhruvPathak about 11 years
    @JonatanHedborg thanks for the correction. I did not pay attention to the point that redis strings are binary safe.
  • Jonatan Hedborg
    Jonatan Hedborg about 11 years
    However, BSON isn't really more compact than JSON (as stated on their site), so it's not really an option.
  • Tian
    Tian almost 4 years
    Is it better to use with gzip.GzipFile(fileobj=stream, mode='w') as compressor:? which in the usual open python function would allow for a proper closer of the file in the event the loop stops.