Python: Convert complex dictionary of strings from Unicode to ASCII

17,921

Recursion seems like the way to go here, but if you're on python 2.xx you want to be checking for unicode, not str (the str type represents a string of bytes, and the unicode type a string of unicode characters; neither inherits from the other and it is unicode-type strings that are displayed in the interpreter with a u in front of them).

There's also a little syntax error in your posted code (the trailing elif: should be an else), and you're not returning the same structure in the case where input is either a dictionary or a list. (In the case of a dictionary, you're returning the converted version of the final key; in the case of a list, you're returning the converted version of the final element. Neither is right!)

You can also make your code pretty and Pythonic by using comprehensions.

Here, then, is what I'd recommend:

def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

One final thing. I changed encode('ascii') to encode('utf-8'). My reasoning is as follows: any unicode string that contains only characters in the ASCII character set will be represented by the same byte string when encoded in ASCII as when encoded in utf-8, so using utf-8 instead of ASCII cannot break anything and the change will be invisible as long as the unicode strings you're dealing with use only ASCII characters. However, this change extends the scope of the function to be able to handle strings of characters from the entire unicode character set, rather than just ASCII ones, should such a thing ever be necessary.

Share:
17,921
Dreen
Author by

Dreen

We're all just clever lines of code.

Updated on June 16, 2022

Comments

  • Dreen
    Dreen almost 2 years

    Possible Duplicate:
    How to get string Objects instead Unicode ones from JSON in Python?

    I have a lot of input as multi-level dictionaries parsed from JSON API calls. The strings are all in unicode which means there is a lot of u'stuff like this'. I am using jq to play around with the results and need to convert these results to ASCII.

    I know I can write a function to just convert it like that:

    def convert(input):
        if isinstance(input, dict):
            ret = {}
            for stuff in input:
                ret = convert(stuff)
        elif isinstance(input, list):
            ret = []
            for i in range(len(input))
                ret = convert(input[i])
        elif isinstance(input, str):
            ret = input.encode('ascii')
        elif :
            ret = input
        return ret
    

    Is this even correct? Not sure. That's not what I want to ask you though.

    What I'm asking is, this is a typical brute-force solution to the problem. There must be a better way. A more pythonic way. I'm no expert on algorithms, but this one doesn't look particularly fast either.

    So is there a better way? Or if not, can this function be improved...?


    Post-answer edit

    Mark Amery's answer is correct but I would like to post a modified version of it. His function works on Python 2.7+ and I'm on 2.6 so had to convert it:

    def convert(input):
        if isinstance(input, dict):
            return dict((convert(key), convert(value)) for key, value in input.iteritems())
        elif isinstance(input, list):
            return [convert(element) for element in input]
        elif isinstance(input, unicode):
            return input.encode('utf-8')
        else:
            return input
    
  • Joel Cornett
    Joel Cornett over 11 years
    +1. Except for you comment about recursion :) Recursion is useful for almost any kind of tree traversal, and most parsing problems. Recursion is often the "way to go", especially when it comes to functional programming.
  • Mark Amery
    Mark Amery over 11 years
    @JoelCornett Fair enough. My comment wasn't meant to be broadly anti-recursion; I can see that recursion makes sense in tree traversal problems, of which I guess a lot of parsing problems are a subset. I'm just pretty new to this game and not from a compsci background, so I haven't come across any problems of that nature myself yet. Examples of recursion I've seen tend to be pointless and contrived, and apply it to situations where iteration would be clearer. This is the first time I've suddenly gone 'whoa, recursion really simplifies things here', which was exciting for me. :)
  • Dreen
    Dreen over 11 years
    Thanks, this is really nice. Much better than any answer in the question that this is supposedly a duplicate of.
  • Dreen
    Dreen over 11 years
    Alsom I posted a modified version of your code for older Python
  • Gil Zellner
    Gil Zellner about 8 years
    Your code didn't work for me for some reason so I did this instead: def unicode_to_string(text): if type(text) is unicode: return text.encode('ascii', 'ignore') if type(text) is list: return [unicode_to_string(a) for a in text] if type(text) is dict: return dict((unicode_to_string(key), unicode_to_string( value)) for key, value in text.iteritems()) return text
  • nishantvas
    nishantvas almost 7 years
    worked like a charm, thanks
  • FlyingZebra1
    FlyingZebra1 almost 5 years
    thank you. works great - py2.7/ubuntu 19 -> input = json response convert w json module