What is the right way to compress and decompress UTF-8 data using zlib?

16,229

Solution 1

Your JSON data is not UTF-8 encoded. The encoding parameter to the json.dumps() function instructs it how to interpret Python byte strings in message (e.g. the input), not how to encode the resulting output. It doesn't encode the output at all because you used ensure_ascii=False.

Encode the data before compression:

ssc = zlib.compress(ss.encode('utf8'))

When decompressing again, there is no need to decode from UTF-8; the json.loads() function assumes UTF-8 if the input is a bytestring.

Solution 2

A little addition to Martijn's response. I read in an Enthought blog a nifty one liner statement that will spare you the need to import zlib in your own code.

Safely compressing a string (including your json dump) would look like that:

ssc = ss.encode('utf-8').encode('zlib_codec')

Decompressing back to utf-8 would be:

ss = ssc.decode('zlib_codec').decode('utf-8')

Hope this helps.

Share:
16,229
I Z
Author by

I Z

merge keep

Updated on July 05, 2022

Comments

  • I Z
    I Z almost 2 years

    I have a very long JSON message that contains characters that go beyond the ASCII table. I convert it into a string as follows:

    messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)
    

    I need to store this string using a service that restricts its size to X bytes. I want to split the JSON string into pieces of length X and store them separately. I ran into some issues doing this (described here) so I want to compress the string slices to work around those issues. I tried to do this:

    ss = mStr[start:fin]    # get piece of length X
    ssc = zlib.compress(ss) # compress it
    

    When I do that, I get the following error from zlib.compress:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)
    

    What is the right way to compress a UTF-8 string and what is then the right way to decompress it?

  • Anshu Dwibhashi
    Anshu Dwibhashi about 9 years
    This is what worked for me, rather than the other answer. Thanks for the epic solution! +1
  • Slawomir
    Slawomir over 5 years
    Above only works in Python 3.x since zlib package (finally) takes byte-array as input note a string. In Python 2.7 this won't work because zlib.compress takes a string and uses ascii codec to turn the input into a byte-array - hence the OP's error message.
  • Martijn Pieters
    Martijn Pieters over 5 years
    @Debriter yes, the problem in the question is unique to Python 2.
  • Lynx-Lab
    Lynx-Lab about 5 years
    @nurettin this code worked on python2 at it was when the question was asked. From your error message, it seems like you are using python3.
  • Jason R Stevens CFA
    Jason R Stevens CFA about 3 years
    I like this answer for avoiding the separate zlib import. I do suspect this penalizes code readability, as the direct use of the zlib module is front and center, whereas zlib_codec in the above is merely part of a chain. Thanks for the great answer!