Handling bad UTF-8 from json,in ruby

11,882

Solution 1

It looks like the JSON response body you are receiving is being received in US-ASCII instead of UTF-8 because Net::HTTP purposely doesn't force encoding.

1.9.3p194 :044 > puts res.body.encoding
US-ASCII

In Ruby 1.9.3, you can force the encoding if you know what it's supposed to be. Try this:

response = res.body.force_encoding('UTF-8')

The JSON parser should then handle the UTF-8 the way you want it to.

References

Solution 2

Using force_encoding seems like the best solution. Following-up to Kevin Dickerson's answer, here's an explanation of the weirdness.

Net::HTTP is sort of a mess.

On 1.9.3:

  • If the server sends a chunked response, you'd always get ASCII-8BIT. This seems to take precedence over the other scenarios.
  • If you call http.request with a Get object, you'd get US-ASCII. This method does not do compression for you.
  • If you call http.get, compression is enabled.
    • if the server supports compression, you'd get ASCII-8BIT
    • if the server doesn't send a compressed body, you'd get US-ASCII

You'd get US-ASCII because when Net::HTTP creates the buffer string to receive the response, it's created in the interpreter's default source file encoding, which is US-ASCII. (The net/ source files, don't have the magic encoding comment at the top, so they use ruby's default.)

The decompression produces ASCII-8BIT because it's hardcoded to do that in the get method when decompressing.

On 2.0, it seems like you always gets UTF-8 back, but this is because that's the default source-file encoding. If you change it via the -K option, the response encoding would change accordingly. Try passing n, e, s, u to -K.

Share:
11,882
hodgesmr
Author by

hodgesmr

Updated on June 26, 2022

Comments

  • hodgesmr
    hodgesmr almost 2 years

    I'm pulling data from remote json at http://hndroidapi.appspot.com/news/format/json/page/?appid=test . The problem I'm running into is that this API appears to be building the JSON without correctly handling UTF-8 encoding (correct me if I'm wrong here). For example, part of the result that gets passed right now is

    {
    "title":"IPad - please don€™t ding while you and I are asleep  ",
    "url":"http://modern-products.tumblr.com/post/25384729998/ipad-please-dont-ding-while-you-and-i-are-asleep",
    "score":"10 points",
    "user":"roee",
    "comments":"18 comments",
    "time":"1 hour ago",
    "item_id":"4128497",
    "description":"10 points by roee 1 hour ago  | 18 comments"
    }
    

    Notice the don€™t. And that isn't the only type of character it is choking on. Is there anything I can do to convert the data into something clean, given that I don't control the API?

    Edit:

    Here is how I'm pulling down the JSON:

    hn_url = "http://hndroidapi.appspot.com/news/format/json/page/?appid=test"
      url = URI.parse(hn_url)
    
      # Attempt to get the json
      req = Net::HTTP::Get.new(hn_url)
      req.add_field('User-Agent', 'Test')
      res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
      response = res.body
      if response.nil?
        puts "Bad response when fetching HN json"
        return
      end
    
      # Attempt to parse the json
      result = JSON.parse(response)
      if result.nil?
        puts "Error parsing HN json"
        return
      end
    

    Edit 2:

    Just found the API's GitHub page. Looks like this is an outstanding issue. Still not sure if there's any workarounds that I can do from my end: https://github.com/glebpopov/Hacker-News-Droid-API/issues/4