UnicodeDecodeError when performing os.walk

16,039

Solution 1

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:

sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):
    if type(name) == str: # leave unicode ones alone
        try:
            name = name.decode('utf8')
        except:
            name = name.decode('windows-1252')
    return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):
    files = [decodeName(f) for f in files]
    # do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

http://farmdev.com/talks/unicode/

I highly recommend it for anyone struggling with unicode issues.

Solution 2

Right I just spent some time sorting through this error, and wordier answers here aren't getting at the underlying issue:

The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence 'ascii' decode error). When it hits a unicode only special character which str() can't translate, it throws the exception.

The solution is to force the starting path you pass to os.walk to be a regular string - i.e. os.walk(str(somepath)). This means os.listdir returns regular byte-like strings and everything works the way it should.

You can reproduce this problem (and show it's solution works) trivially like:

  1. Go into bash in some directory and run touch $(echo -e "\x8b\x8bThis is a bad filename") which will make some test files.

  2. Now run the following Python code (iPython Qt is handy for this) in the same directory:

    l = []
    for root,dir,filenames in os.walk(unicode('.')):
        l.extend([ os.path.join(root, f) for f in filenames ])
    print l
    

And you'll get a UnicodeDecodeError.

  1. Now try running:

    l = []
    for root,dir,filenames in os.walk('.'):
        l.extend([ os.path.join(root, f) for f in filenames ])
    print l
    

No error and you get a print out!

Thus the safe way in Python 2.x is to make sure you only pass raw text to os.walk(). You absolutely should not pass unicode or things which might be unicode to it, because os.walk will then choke when an internal ascii conversion fails.

Solution 3

I can reproduce the os.listdir() behavior: os.listdir(unicode_name) returns undecodable entries as bytes on Python 2.7:

>>> import os
>>> os.listdir(u'.')
[u'abc', '<--\x8b-->']

Notice: the second name is a bytestring despite listdir()'s argument being a Unicode string.

A big question remains however - how can this be solved without resorting to this hack?

Python 3 solves undecodable bytes (using filesystem's character encoding) bytes in filenames via surrogateescape error handler (os.fsencode/os.fsdecode). See PEP-383: Non-decodable Bytes in System Character Interfaces:

>>> os.listdir(u'.')
['abc', '<--\udc8b-->']

Notice: both string are Unicode (Python 3). And surrogateescape error handler was used for the second name. To get the original bytes back:

>>> os.fsencode('<--\udc8b-->')
b'<--\x8b-->'

In Python 2, use Unicode strings for filenames on Windows (Unicode API), OS X (utf-8 is enforced) and use bytestrings on Linux and other systems.

Solution 4

\x8 is not a valid utf-8 encoding character. os.path expects the filenames to be in utf-8. If you want to access invalid filenames, you have to pass the os.path.walk the non-unicode startpath; this way the os module will not do the utf8 decoding. You would have to do it yourself and decide what to do with the filenames that contain incorrect characters.

I.e.:

for root, dirs, files in os.walk(startpath.encode('utf8')):

Solution 5

After examination of the source of the error, something happens within the C-code routine listdir which returns non-unicode filenames when they are not standard ascii. The only fix therefore is to do a forced decode of the directory list within os.walk, which requires a replacement of os.walk. This replacement function works:

def asciisafewalk(top, topdown=True, onerror=None, followlinks=False):
    """
    duplicate of os.walk, except we do a forced decode after listdir
    """
    islink, join, isdir = os.path.islink, os.path.join, os.path.isdir

    try:
        # Note that listdir and error are globals in this module due
        # to earlier import-*.
        names = os.listdir(top)
        # force non-ascii text out
        names = [name.decode('utf8','ignore') for name in names]
    except os.error, err:
        if onerror is not None:
            onerror(err)
        return

    dirs, nondirs = [], []
    for name in names:
        if isdir(join(top, name)):
            dirs.append(name)
        else:
            nondirs.append(name)

    if topdown:
        yield top, dirs, nondirs
    for name in dirs:
        new_path = join(top, name)
        if followlinks or not islink(new_path):
            for x in asciisafewalk(new_path, topdown, onerror, followlinks):
                yield x
    if not topdown:
        yield top, dirs, nondirs

By adding the line: names = [name.decode('utf8','ignore') for name in names] all the names are proper ascii & unicode, and everything works correctly.

A big question remains however - how can this be solved without resorting to this hack?

Share:
16,039

Related videos on Youtube

Scott
Author by

Scott

Updated on September 10, 2020

Comments

  • Scott
    Scott over 3 years

    I am getting the error:

    'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
    

    when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.

    I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?

    If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.

    UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:

    >>> test = 'a string \x8b with non-ascii'
    >>> test
    'a string \x8b with non-ascii'
    >>> unicode(test)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in  range(128)
    >>> 
    >>> test2 = u'a string \x8b with non-ascii'
    >>> test2
    u'a string \x8b with non-ascii'
    

    Here's a traceback of the error I am getting:

    80.         for root, dirs, files in os.walk(unicode(startpath)):
    File "/usr/lib/python2.7/os.py" in walk
    294.             for x in walk(new_path, topdown, onerror, followlinks):
    File "/usr/lib/python2.7/os.py" in walk
    294.             for x in walk(new_path, topdown, onerror, followlinks):
    File "/usr/lib/python2.7/os.py" in walk
    284.         if isdir(join(top, name)):
    File "/usr/lib/python2.7/posixpath.py" in join
    71.             path += '/' + b
    
    Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
    Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
    

    The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):

    names = listdir(top)
    

    The names with chars > 128 are returned as non-unicode strings.

    • Jayanth Koushik
      Jayanth Koushik about 10 years
      I guess you could catch the exceptions and handle them separately?
    • user2357112
      user2357112 about 10 years
      Can you show the full traceback?
    • Jon Skeet
      Jon Skeet about 10 years
      What do you mean by "non-UTF8"? Byte 0x8b certainly isn't valid as ASCII, but we'd need to see the following bytes to know whether it was valid as UTF-8. Just because you've seen a byte of 0x8b doesn't mean it's trying to represent U+008B as a character.
    • User
      User about 10 years
      try: os.walk(unicode(path)).
  • Scott
    Scott about 10 years
    This does not appear to work, and I don't see how it would either - see my eventual solution below.
  • jfs
    jfs about 10 years
    this method drops directories and files that contain undecodable bytes and it introduces names that are not present on the filesystem or duplicate the existing names (it is not good).
  • Scott
    Scott about 10 years
    Since I am using python 2.x, is there a way to deal with the problem when I have no control over these encoded filenames? BTW, the files originated on a Windows system, but now reside on a Linux one.
  • jfs
    jfs about 10 years
    @Scott: yes. If filenames could be arbitrary byte sequences then just use bytes. If you pass bytes to os.walk(); it returns bytes. You could decode them into Unicode later if you think the filenames are not corrupted (it should be possible to guess the character encoding if you have many names).
  • Scott
    Scott almost 10 years
    J.F. Sebastian: You are absolutely correct. I have finally found the correct solution that does not involve any hacks. I have it marked as the correct one.
  • jfs
    jfs almost 10 years
    -1. it is still wrong. Default encoding has nothing to do with it. Python doesn't use it to decode filenames. Undecodable filenames (as the name suggests) can't be decoded using any character encoding. See my answer.
  • DoTheEvo
    DoTheEvo almost 9 years
    @ondra your solution works but... >You would have to do it yourself and decide what to do with the filenames that contain incorrect characters. And i am looking for that solution for some time now, how to do that, but exceptions leave the walk loop. I am putting walk os data in to sqlite database.
  • Simon Steinberger
    Simon Steinberger about 8 years
    Works yet beautifully for me on Windows. So, thanks! Best solution, even if there are special cases that may not work.
  • Germán Carrillo
    Germán Carrillo over 7 years
    Thanks! The only solution that worked for me as well. Had nasty encoding problems using os.walk until I apply this solution. When I wanted to use root (to join it with each file in files) I had to call the function again in this way: os.path.join( self.decodeName( root ), file ).
  • jfs
    jfs about 5 years
    @SimonSteinberger this solution may lead to mojibake (you data is corrupted silently)