python byte string encode and decode

python json unicode utf-8 python-unicode

31,067

Solution 1

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.

If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.

If it is in fact text, you need to know what encoding to use to decode it to Unicode.

Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:

>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>

UPDATE about latin1:

u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Solution 2

Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:

In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'

Solution 3

I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.

unicode_b = b.decode('some_encoding')

and then re-encoding the unicode object using the utf_8 encoding back to a string object.

b = unicode_b.encode('utf_8')

Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.

31,067

Author by

kung-foo

Updated on July 13, 2022

Comments

kung-foo almost 2 years
I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.
```
b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)
```
I expected j to be '\xc2\x80' but instead I get:
```
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
```
In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.

Any ideas?

EDIT: I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)
Daniel Roseman about 12 years

Remember: decode goes from bytes to unicode. encode goes from unicode to bytes.
Marcin about 12 years

@DanielRoseman Yes, that is why this is an answer to the question.
Daniel Roseman about 12 years

Sure, I wasn't arguing, just providing some extra explanation for the OP.
kung-foo about 12 years

I had come up with the same thing but it seemed so inefficient. I am surprised there isn't a way to directly utf-8 encode a byte string.
Marcin about 12 years

@kung-foo What do you mean? In what way is this not "direct"?
John Machin about 12 years

-1 (a) You cannot infer an encoding from one byte! (b) u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. (e) encoding into UTF-8 is pointless -- see my answer.
kung-foo about 12 years

interesting. i guess i can keep it simple: print json.dumps('\x80'.decode('latin1'))
John Machin about 12 years

@kung-foo: You have no evidence that latin1 is the correct encoding.
kung-foo about 12 years

Hence my original question. I have a 'byte' string. No encoding. But encoding into utf-8 gives me flexibility to pass the data around.
kung-foo about 12 years

Then what is the method for encoding a string of bytes into utf-8?
Marcin about 12 years

@JohnMachin Where do I suggest that one can "infer an encoding from one byte", or suggest that the encoding is latin-1? In any case, given that latin-1 matches UTF-8 for code points up to 255, it's a fairly safe choice if there is no real encoding.
Marcin about 12 years

@kung-foo You are seriously confused - unicode strings are for representing strings of characters (the abstract correspondents of unicode codepoints); byte strings are for representing bytes. If this is not textual data, keep it as a byte string. If it is data which is already utf-8 encoded, decode it as utf-8.
Marcin about 12 years

Where do BLOBs come into this?
John Machin about 12 years

@kung-foo: The method is a_string_of_bytes.decode('some_encoding').encode('utf8') ... however you don't need to encode into utf8 to be able to use json.dumps; you DO need to establish if your data is text and if so, what some_encoding is, and latin1 is unlikely ... ummm in other words: just re-read my answer slowly and carefully
Marcin about 12 years

@kung-foo: Although, I guess in this case, if you are transporting as JSON you have no choice but to stuff this into unicode.
John Machin about 12 years

@Marcin: quoting the OP: "is filled out with some blob data"
Marcin about 12 years

So, what is your solution for getting this binary data into a JSON string, if not stuffing it into a unicode object first?
John Machin about 12 years

@Marcin: Marcin: in the current context, latin1 is a ludicrous choice for an example of how to decode. "latin-1 matches UTF-8 for code points up to 255" is meaningless -- unichr(255).encode('utf8') produces \xc3\xbf but unichr(255).encode('latin1') produces '\xff'. If there is no real encoding, then the data cannot be meaningfully decoded.
John Machin about 12 years

@Marcin: The OP has to read the docs for the API he's using for input, and the docs for whatever is going to consume the JSON string. Making suggestions without a full problem statement is NOT helpful.
Marcin about 12 years

@JohnMachin It sounds like you're refusing to engage with the fact that the json encoder will only work with unicode.
John Machin about 12 years

Would the anonymous downvoter care to explain which part of my answer is thought to be incorrect?
John Machin about 12 years

@Marcin: "refusing to engage": nonsense. E.g. he consumer may require a base-64-encoded string, in which case one would need to base-64-encode the original data, then use json.dumps(b64_encoded_data.decode('ascii')) -- speculation is pointless.
kung-foo about 12 years

@JohnMachin: I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc).