python byte string encode and decode

31,067

Solution 1

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.

If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.

If it is in fact text, you need to know what encoding to use to decode it to Unicode.

Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:

>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>

UPDATE about latin1:

u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Solution 2

Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:

In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'

Solution 3

I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.

unicode_b = b.decode('some_encoding')

and then re-encoding the unicode object using the utf_8 encoding back to a string object.

b = unicode_b.encode('utf_8')

Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.

Share:
31,067
kung-foo
Author by

kung-foo

Updated on July 13, 2022

Comments

  • kung-foo
    kung-foo almost 2 years

    I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.

    b = '\x80'
    u8 = b.encode('utf-8')
    j = json.dumps(u8)
    

    I expected j to be '\xc2\x80' but instead I get:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
    

    In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.

    Any ideas?

    EDIT: I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)

  • Daniel Roseman
    Daniel Roseman about 12 years
    Remember: decode goes from bytes to unicode. encode goes from unicode to bytes.
  • Marcin
    Marcin about 12 years
    @DanielRoseman Yes, that is why this is an answer to the question.
  • Daniel Roseman
    Daniel Roseman about 12 years
    Sure, I wasn't arguing, just providing some extra explanation for the OP.
  • kung-foo
    kung-foo about 12 years
    I had come up with the same thing but it seemed so inefficient. I am surprised there isn't a way to directly utf-8 encode a byte string.
  • Marcin
    Marcin about 12 years
    @kung-foo What do you mean? In what way is this not "direct"?
  • John Machin
    John Machin about 12 years
    -1 (a) You cannot infer an encoding from one byte! (b) u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. (e) encoding into UTF-8 is pointless -- see my answer.
  • kung-foo
    kung-foo about 12 years
    interesting. i guess i can keep it simple: print json.dumps('\x80'.decode('latin1'))
  • John Machin
    John Machin about 12 years
    @kung-foo: You have no evidence that latin1 is the correct encoding.
  • kung-foo
    kung-foo about 12 years
    Hence my original question. I have a 'byte' string. No encoding. But encoding into utf-8 gives me flexibility to pass the data around.
  • kung-foo
    kung-foo about 12 years
    Then what is the method for encoding a string of bytes into utf-8?
  • Marcin
    Marcin about 12 years
    @JohnMachin Where do I suggest that one can "infer an encoding from one byte", or suggest that the encoding is latin-1? In any case, given that latin-1 matches UTF-8 for code points up to 255, it's a fairly safe choice if there is no real encoding.
  • Marcin
    Marcin about 12 years
    @kung-foo You are seriously confused - unicode strings are for representing strings of characters (the abstract correspondents of unicode codepoints); byte strings are for representing bytes. If this is not textual data, keep it as a byte string. If it is data which is already utf-8 encoded, decode it as utf-8.
  • Marcin
    Marcin about 12 years
    Where do BLOBs come into this?
  • John Machin
    John Machin about 12 years
    @kung-foo: The method is a_string_of_bytes.decode('some_encoding').encode('utf8') ... however you don't need to encode into utf8 to be able to use json.dumps; you DO need to establish if your data is text and if so, what some_encoding is, and latin1 is unlikely ... ummm in other words: just re-read my answer slowly and carefully
  • Marcin
    Marcin about 12 years
    @kung-foo: Although, I guess in this case, if you are transporting as JSON you have no choice but to stuff this into unicode.
  • John Machin
    John Machin about 12 years
    @Marcin: quoting the OP: "is filled out with some blob data"
  • Marcin
    Marcin about 12 years
    So, what is your solution for getting this binary data into a JSON string, if not stuffing it into a unicode object first?
  • John Machin
    John Machin about 12 years
    @Marcin: Marcin: in the current context, latin1 is a ludicrous choice for an example of how to decode. "latin-1 matches UTF-8 for code points up to 255" is meaningless -- unichr(255).encode('utf8') produces \xc3\xbf but unichr(255).encode('latin1') produces '\xff'. If there is no real encoding, then the data cannot be meaningfully decoded.
  • John Machin
    John Machin about 12 years
    @Marcin: The OP has to read the docs for the API he's using for input, and the docs for whatever is going to consume the JSON string. Making suggestions without a full problem statement is NOT helpful.
  • Marcin
    Marcin about 12 years
    @JohnMachin It sounds like you're refusing to engage with the fact that the json encoder will only work with unicode.
  • John Machin
    John Machin about 12 years
    Would the anonymous downvoter care to explain which part of my answer is thought to be incorrect?
  • John Machin
    John Machin about 12 years
    @Marcin: "refusing to engage": nonsense. E.g. he consumer may require a base-64-encoded string, in which case one would need to base-64-encode the original data, then use json.dumps(b64_encoded_data.decode('ascii')) -- speculation is pointless.
  • kung-foo
    kung-foo about 12 years
    @JohnMachin: I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc).