Handling bad UTF-8 from json,in ruby
Solution 1
It looks like the JSON response body you are receiving is being received in US-ASCII instead of UTF-8 because Net::HTTP
purposely doesn't force encoding.
1.9.3p194 :044 > puts res.body.encoding
US-ASCII
In Ruby 1.9.3, you can force the encoding if you know what it's supposed to be. Try this:
response = res.body.force_encoding('UTF-8')
The JSON parser should then handle the UTF-8 the way you want it to.
References
Solution 2
Using force_encoding
seems like the best solution.
Following-up to Kevin Dickerson's answer, here's an explanation of the weirdness.
Net::HTTP
is sort of a mess.
On 1.9.3:
- If the server sends a chunked response, you'd always get ASCII-8BIT. This seems to take precedence over the other scenarios.
- If you call
http.request
with aGet
object, you'd get US-ASCII. This method does not do compression for you. - If you call
http.get
, compression is enabled.- if the server supports compression, you'd get ASCII-8BIT
- if the server doesn't send a compressed body, you'd get US-ASCII
You'd get US-ASCII because when Net::HTTP
creates the buffer string to receive the response, it's created in the interpreter's default source file encoding, which is US-ASCII. (The net/
source files, don't have the magic encoding comment at the top, so they use ruby's default.)
The decompression produces ASCII-8BIT because it's hardcoded to do that in the get
method when decompressing.
On 2.0, it seems like you always gets UTF-8 back, but this is because that's the default source-file encoding. If you change it via the -K
option, the response encoding would change accordingly. Try passing n
, e
, s
, u
to -K
.
hodgesmr
Updated on June 26, 2022Comments
-
hodgesmr almost 2 years
I'm pulling data from remote json at http://hndroidapi.appspot.com/news/format/json/page/?appid=test . The problem I'm running into is that this API appears to be building the JSON without correctly handling UTF-8 encoding (correct me if I'm wrong here). For example, part of the result that gets passed right now is
{ "title":"IPad - please don€™t ding while you and I are asleep ", "url":"http://modern-products.tumblr.com/post/25384729998/ipad-please-dont-ding-while-you-and-i-are-asleep", "score":"10 points", "user":"roee", "comments":"18 comments", "time":"1 hour ago", "item_id":"4128497", "description":"10 points by roee 1 hour ago | 18 comments" }
Notice the
don€™t
. And that isn't the only type of character it is choking on. Is there anything I can do to convert the data into something clean, given that I don't control the API?Edit:
Here is how I'm pulling down the JSON:
hn_url = "http://hndroidapi.appspot.com/news/format/json/page/?appid=test" url = URI.parse(hn_url) # Attempt to get the json req = Net::HTTP::Get.new(hn_url) req.add_field('User-Agent', 'Test') res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) } response = res.body if response.nil? puts "Bad response when fetching HN json" return end # Attempt to parse the json result = JSON.parse(response) if result.nil? puts "Error parsing HN json" return end
Edit 2:
Just found the API's GitHub page. Looks like this is an outstanding issue. Still not sure if there's any workarounds that I can do from my end: https://github.com/glebpopov/Hacker-News-Droid-API/issues/4