Let JSON object accept bytes or let urlopen output strings

python json python-3.x encoding urlopen

156,376

Solution 1

HTTP sends bytes. If the resource in question is text, the character encoding is normally specified, either by the Content-Type HTTP header or by another mechanism (an RFC, HTML meta http-equiv,...).

urllib should know how to encode the bytes to a string, but it's too naïve—it's a horribly underpowered and un-Pythonic library.

Dive Into Python 3 provides an overview about the situation.

Your "work-around" is fine—although it feels wrong, it's the correct way to do it.

Solution 2

Python’s wonderful standard library to the rescue…

import codecs

reader = codecs.getreader("utf-8")
obj = json.load(reader(response))

Works with both py2 and py3.

Docs: Python 2, Python3

Solution 3

I have come to opinion that the question is the best answer :)

import json
from urllib.request import urlopen

response = urlopen("site.com/api/foo/bar").read().decode('utf8')
obj = json.loads(response)

Solution 4

For anyone else trying to solve this using the requests library:

import json
import requests

r = requests.get('http://localhost/index.json')
r.raise_for_status()
# works for Python2 and Python3
json.loads(r.content.decode('utf-8'))

Solution 5

This one works for me, I used 'request' library with json() check out the doc in requests for humans

import requests

url = 'here goes your url'

obj = requests.get(url).json()

View more solutions

156,376

Peter Smit

Currently working as Doctoral Student in the Speech Group of the Department of Signal Processing and Acoustics of the Aalto Univerity School of Electrical Engineering (formerly TKK / Helsinki University of Technology) in Helsinki, Finland.

Updated on September 26, 2020

Comments

Peter Smit over 3 years
With Python 3 I am requesting a json document from a URL.
```
response = urllib.request.urlopen(request)
```
The response object is a file-like object with read and readline methods. Normally a JSON object can be created with a file opened in text mode.
```
obj = json.load(fp)
```
What I would like to do is:
```
obj = json.load(response)
```
This however does not work as urlopen returns a file object in binary mode.

A work around is of course:
```
str_response = response.read().decode('utf-8')
obj = json.loads(str_response)
```
but this feels bad...

Is there a better way that I can transform a bytes file object to a string file object? Or am I missing any parameters for either urlopen or json.load to give an encoding?
- Bob Yoplait about 7 years
  
  I think you have a typo there, "readall" should be "read" ?
- CaptainNemo over 6 years
  
  @BobYoplait I agree.
ThatAintWorking about 10 years

This may be the "correct" way to do it but if there was one thing I could undo about Python 3 it would be this bytes/strings crap. You would think the built-in library functions would at least know how to deal with other built-in library functions. Part of the reason we use python is the simple intuitive syntax. This change breaks that all over the place.
offby1 over 9 years

Check out the "requests" library -- it handles this sort of thing for you automagically.
jbg over 9 years

This isn’t a case of the built-in library functions needing to “know how” to deal with other functions. JSON is defined as a UTF-8 representation of objects, so it can’t magically decode bytes that it doesn’t know the encoding of. I do agree that urlopen ought to be able to decode the bytes itself since it knows the encoding. Anyway, I’ve posted the Python standard library solution as an answer — you can do streaming decoding of bytes using the codecs module.
Aaron Lelevier almost 9 years

I got this error when trying this answer in python 3.4.3 not sure why? The error was TypeError: the JSON object must be str, not 'StreamReader'
sleepycal over 8 years

@AronYsidoro Did you possibly use json.loads() instead of json.load()?
Phil Frost about 8 years

For bonus points, use the encoding specified in the response, instead of assuming utf-8: response.headers.get_content_charset(). Returns None if there is no encoding, and doesn't exist on python2.
jbg about 8 years

@PhilFrost That’s slick. In practice it might pay to be careful with that; JSON is always UTF-8, UTF-16 or UTF-32 by definition (and is overwhelmingly likely to be UTF-8), so if another encoding is returned by the web server, it’s possibly a misconfiguration of the web server software rather than genuinely non-standard JSON.
jfs about 8 years

@jbg: json itself is a text format—it knows nothing about character encodings and bytes. Nothing stops you storing it on disk using any character encoding you like. Though RFCs for application/json media type say: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32." i.e., a web server must use only these encodings. Also, there is no charset parameter defined for application/json and the recent rfc specify no way to detect the encoding. It makes utf-8 the only choice.
jfs about 8 years

@PhilFrost it exists on Python 2 as response.headers.getparam('charset'), see A good way to get the charset/encoding of an HTTP response in Python. Though as I said in the previous comment: It doesn't help with json.
Harper Koo over 7 years

when I used in in python 3.5, the error was "AttributeError: 'bytes' object has no attribute 'read'"
jbg over 7 years

@harperkoo: Did you possibly pass a bytes object as the response variable instead of a file-like object? If you already have a bytes object and just want to decode it, you can simply call the decode(encoding) method on it.
jbg over 7 years

This functionality is built-in to requests: you can simply do r.json()
sfblackl about 7 years

I got to this page because I was having an issue with Flask unit tests - thanks for posting the single line call.
Blairg23 almost 7 years

The clarify, if you use @jbg's method, you don't need to do json.loads. All you have to do is r.json() and you've got your JSON object loaded into a dict already.
EvertW almost 7 years

@ThatAintWorking: I would disagree. While it is a pain in the neck to explicitly have to manage the difference between bytes and strings, it is a much greater pain to have the language make some implicit conversion for you. Implicit bytes <-> string conversions are a source of many bugs, and Python3 is very helpful in pointing out the pitfalls. But I agree the library has room for improvement in this area.
ThatAintWorking almost 7 years

@EvertW the failure, in my opinion, it forcing strings to be unicode in the first place.
EvertW almost 7 years

@ThatAintWorking: No, strings must be Unicode, if you want software that can be used in other places than the UK or USA. For decades we have suffered under the myoptic worldview of the ASCII committee. Python3 finally got it right. Might have something to do with Python originating in Europe...
andilabs about 6 years

`*** AttributeError: 'Response' object has no attribute 'readable'``
andilabs about 6 years

*** AttributeError: 'bytes' object has no attribute 'readable'
andilabs about 6 years

*** UnicodeEncodeError: 'ascii' codec can't encode characters in position 264-265: ordinal not in range(128)
Collin Anderson about 6 years

Are you using urllib or requests? This is for urllib. If you have a bytes object, just use json.loads(bytes_obj.decode()).
BMDan almost 5 years

@jfs @jbg @phil-frost RFC8259 says, "Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients." Whether it is therefore better to trust, to ignore, or to trust-but-heuristically-evaluate-and-then-work-around a charset that a server nonetheless elected to send is likely a problem of the deepest sort of bikeshedding variety.
jfs almost 5 years

@BMDan follow the link in my comment above that literally says: "no charset parameter defined..."
Baldrickk over 4 years

This is the best way. Really readable, and anyone who is doing something like this should have requests.