How do I check if a string is unicode or ascii?

421,880

Solution 1

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

Solution 2

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn't know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2's unicode, and is used to store text. What was called str in Python 2 is called bytes in Python 3.


How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn't valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Solution 3

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

isinstance(x, str)

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a 'string-like' object all with one statement though, you can do the following:

isinstance(x, basestring)

Solution 4

Unicode is not an encoding - to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are "text" ...

...then Unicode is "text-ness";

it is the abstract form of text

Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.

Solution 5

If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there's probably a better way.

Share:
421,880

Related videos on Youtube

TIMEX
Author by

TIMEX

Updated on April 08, 2021

Comments

  • TIMEX
    TIMEX about 3 years

    What do I have to do in Python to figure out which encoding a string has?

    • ulidtko
      ulidtko about 13 years
      Unicode is not an encoding.
    • alex
      alex over 5 years
      @Johnsyweb Because of {UnicodeDecodeError} 'ascii' codec can't decode byte 0xc2
  • Ian Clelland
    Ian Clelland over 11 years
    The better way is probably to use six, and test against six.binary_type and six.text_type
  • user222758
    user222758 over 10 years
    You can use type(s).__name__ to probe type names.
  • PythonNut
    PythonNut about 10 years
    This is false. In Python 2.7 isinstance(u"x",basestring) returns True.
  • KQ.
    KQ. about 10 years
    @PythonNut: I believe that was the point. The use of isinstance(x, basestring) suffices to replace the distinct dual tests above.
  • PythonNut
    PythonNut about 10 years
    No, but isinstance(x, basestring) is True for both unicode and regular strings, making the test useless.
  • oligofren
    oligofren almost 10 years
    I am not quite sure of the use case for that bit of code, unless there is a logic error. I think there should be a "not" in the python 2 code. Otherwise you are converting everything to unicode strings for Python 3 and the opposite for Python 2!
  • mhsmith
    mhsmith about 9 years
    It's useful in many cases, but evidently not what the questioner meant.
  • fiatjaf
    fiatjaf about 9 years
    This is the answer to the question. All others misunderstood what OP said and gave generic answers about type checking in Python.
  • Shadow
    Shadow over 7 years
    Just for other people's reference - str.decode doesn't not exist in python 3. Looks like you have to unicode(s, "ascii") or something
  • Dave Burton
    Dave Burton over 7 years
    Yes, oligofren, that's what it does. The standard internal strings are Unicode in Python 3 and ASCII in Python 2. So the code snippets convert text to standard internal string type (be it Unicode or ASCII).
  • karantan
    karantan over 7 years
    it should be if isinstance(obj, six.text_type) . But yes this is imo the correct answer.
  • Shadow
    Shadow over 7 years
    Sorry, I meant str(s, "ascii")
  • MD004
    MD004 about 6 years
    Doesn't answer OP's question. The title of the question (alone) COULD be interpreted such that this answer is correct. However, OP specifically says "figure out which" in the question's description, and this answer does not address that.
  • MD004
    MD004 about 6 years
    Doesn't answer OP's question. The title of the question (alone) COULD be interpreted such that this answer is correct. However, OP specifically says "figure out which" in the question's description, and this answer does not address that.
  • Jonny
    Jonny over 5 years
    Those slides are probably the best introduction to Unicode I've come across to date
  • Greg Hewgill
    Greg Hewgill about 5 years
    @ProsperousHeart: You're probably using Python 3.
  • ProsperousHeart
    ProsperousHeart about 5 years
    This is not accurate for python 3
  • winbina
    winbina about 5 years
    @ProsperousHeart Updated to cover Python 3. And to try to explain the difference between bytestrings and unicode strings.
  • Mansour.M
    Mansour.M over 4 years
    You my friend deserve to be the correct response! I am using python 3 and I was still having problems until I found this treasure!
  • RicarHincapie
    RicarHincapie over 3 years
    decode() method's default is 'utf-8'. So, if you call this method over a class 'bytes', you would get a 'OK' with print("utf8 content:", html.decode()), for example.
  • Edward Falk
    Edward Falk over 2 years
    Note: first, you need to confirm you're running Python2. If your code is designed to run under either Python2 or Python3, you'll need to check your Python version first.