Removing non-ascii characters from any given stringtype in Python

11,139

Solution 1

It's simple: .encode converts Unicode objects into strings, and .decode converts strings into Unicode.

Solution 2

Why did the decode("ascii") give out a unicode string?

Because that's what decode is for: it decodes byte strings like your ASCII one into unicode.

In your second example, you're trying to "decode" a string which is already unicode, which has no effect. To print it to your terminal, though, Python must encode it as your default encoding, which is ASCII - but because you haven't done that step explicitly and therefore haven't specified the 'ignore' parameter, it raises the error that it can't encode the non-ASCII characters.

The trick to all of this is remembering that decode takes an encoded bytestring and converts it to Unicode, and encode does the reverse. It might be easier if you understand that Unicode is not an encoding.

Share:
11,139
fullmooninu
Author by

fullmooninu

http://namegrep.com https://github.com/fullmooninu Nothing else to report, sir.

Updated on June 09, 2022

Comments

  • fullmooninu
    fullmooninu almost 2 years
    >>> teststring = 'aõ'
    >>> type(teststring)
    <type 'str'>
    >>> teststring
    'a\xf5'
    >>> print teststring
    aõ
    >>> teststring.decode("ascii", "ignore")
    u'a'
    >>> teststring.decode("ascii", "ignore").encode("ascii")
    'a'
    

    which is what i really wanted it to store internally as i remove non-ascii characters. Why did the decode("ascii give out a unicode string ?

    >>> teststringUni = u'aõ'
    >>> type(teststringUni)
    <type 'unicode'>
    >>> print teststringUni
    aõ
    >>> teststringUni.decode("ascii" , "ignore")
    
    Traceback (most recent call last):
      File "<pyshell#79>", line 1, in <module>
        teststringUni.decode("ascii" , "ignore")
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
    >>> teststringUni.decode("utf-8" , "ignore")
    
    Traceback (most recent call last):
      File "<pyshell#81>", line 1, in <module>
        teststringUni.decode("utf-8" , "ignore")
      File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
    >>> teststringUni.encode("ascii" , "ignore")
    'a'
    

    Which is again what i wanted. I don't understand this behavior. Can someone explain to me what is happening here?

    edit: i thought this would me understand things so i could solve my real program problem that i state here: Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)