How can I detect if a file is binary (non-text) in Python?

86,245

Solution 1

You can also use the mimetypes module:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

Solution 2

Yet another method based on file(1) behavior:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

Solution 3

If you're using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError. Python3 will use unicode when handling files in text mode (and bytearray in binary mode) - if your encoding can't decode arbitrary files it's quite likely that you will get UnicodeDecodeError.

Example:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

Solution 4

Try this:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <[email protected]>
    @author: Jorge Orpinel <[email protected]>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

Solution 5

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.

Share:
86,245

Related videos on Youtube

grieve
Author by

grieve

Something clever

Updated on July 08, 2022

Comments

  • grieve
    grieve almost 2 years

    How can I tell if a file is binary (non-text) in Python?

    I am searching through a large set of files in Python, and keep getting matches in binary files. This makes the output look incredibly messy.

    I know I could use grep -I, but I am doing more with the data than what grep allows for.

    In the past, I would have just searched for characters greater than 0x7f, but utf8 and the like, make that impossible on modern systems. Ideally, the solution would be fast.

    • Ishbir
      Ishbir about 15 years
      IF "in the past I would have just searched for characters greater than 0x7f" THEN you used to work with plain ASCII text THEN still no issue since ASCII text encoded as UTF-8 remains ASCII (i.e. no bytes > 127).
    • grieve
      grieve about 15 years
      @ΤΖΩΤΖΙΟΥ: True, but I happen to know that the some of the files I am dealing with are utf8. I meant used to in the general sense, not in the specific sense of these files. :)
    • SigTerm
      SigTerm about 14 years
      Only with probability. You can check if: 1) file contains \n 2) Amount of bytes between \n's is relatively small (this is NOT reliable)l 3) file doesn't bytes with value less than value of ASCCI "space" character (' ') - EXCEPT "\n" "\r" "\t" and zeroes.
    • intuited
      intuited over 13 years
      The strategy that grep itself uses to identify binary files is similar to that posted by Jorge Orpinel below. Unless you set the -z option, it will just scan for a null character ("\000") in the file. With -z, it scans for "\200". Those interested and/or skeptical can check line 1126 of grep.c. Sorry, I couldn't find a webpage with the source code, but of course you can get it from gnu.org or via a distro.
    • intuited
      intuited over 13 years
      P.S. As mentioned in the comments thread for Jorge's post, this strategy will give false positives for files containing, for example, UTF-16 text. Nonetheless, both git diff and GNU diff also use the same strategy. I'm not sure if it's so prevalent because it's so much faster and easier than the alternative, or if it's just because of the relative rarity of UTF-16 files on systems which tend to have these utils installed.
    • guettli
      guettli over 9 years
      Use a library (see my answer below).
    • Hans Ginzel
      Hans Ginzel over 3 years
      Use perl -ne 'print if -B' filename, see stackoverflow.com/questions/29516984/…. See github.com/Perl/perl5/blob/blead/pp_sys.c#L3543 for implementation.
  • David Z
    David Z about 15 years
    For reference, the file command guesses a type based on the file's content. I'm not sure whether it pays any attention to the file extension.
  • fortran
    fortran about 15 years
    I'm almost sure it looks both in the content and the extension.
  • Alan Plum
    Alan Plum over 14 years
    This breaks if the path contains "text", tho. Make sure to rsplit at the last ':' (provided there's no colon in the file type description).
  • dubek
    dubek over 14 years
    Use file with the -b switch; it'll print only the file type without the path.
  • John Machin
    John Machin about 14 years
    -1 defines "binary" as containing a zero byte. Will classify UTF-16-encoded text files as "binary".
  • intuited
    intuited over 13 years
    @John Machin: Interestingly, git diff actually works this way, and sure enough, it detects UTF-16 files as binary.
  • intuited
    intuited over 13 years
    Is there a way to get mimetypes to use the contents of a file rather than just its name?
  • intuited
    intuited over 13 years
    Hunh.. GNU diff also works this way. It has similar issues with UTF-16 files. file does correctly detect the same files as UTF-16 text. I haven't checked out grep 's code, but it too detects UTF-16 files as binary.
  • jfs
    jfs almost 13 years
    +1 @John Machin: utf-16 is a character data according to file(1) that is not safe to print without conversion so this method is appropriate in this case.
  • jfs
    jfs almost 13 years
    a slightly nicer version: is_binary_file = lambda filename: "text" in subprocess.check_output(["file", "-b", filename])
  • Bengt
    Bengt almost 12 years
    That is what libmagic is for. It can be accessed in python via python-magic.
  • Bengt
    Bengt almost 12 years
    @intuited No, but libmagic does that. Use it via python-magic.
  • Sam Watkins
    Sam Watkins over 11 years
    -1 - I don't think 'contains a zero byte' is an adequate test for binary vs text, for example I can create a file containing all 0x01 bytes or repeat 0xDEADBEEF, but it is not a text file. The answer based on file(1) is better.
  • Sam Watkins
    Sam Watkins over 11 years
    There is a similar question with some good answers here: stackoverflow.com/questions/1446549/… The answer based on an activestate recipe looks good to me, it allows a small proportion of non-printable characters (but no \0, for some reason).
  • jfs
    jfs about 10 years
    note: for line in file may consume unlimited amount of memory until b'\n' is found
  • jfs
    jfs about 10 years
    to @Community: ".read()" returns a bytestring here that is iterable (it yields individual bytes).
  • Purrell
    Purrell over 9 years
    This isn't a great answer only because the mimetypes module is not good for all files. I'm looking at a file now which system file reports as "UTF-8 Unicode text, with very long lines" but mimetypes.gest_type() will return (None, None). Also, Apache's mimetype list is a whitelist/subset. It is by no means a complete list of mimetypes. It cannot be used to classify all files as either text or non-text.
  • Purrell
    Purrell over 9 years
    Unfortunately, "does not begin with a known magic number" is not equivalent to "is a text file".
  • spectras
    spectras almost 9 years
    Can get both false positive and false negatives, but still is a clever approach that works for the large majority of files. +1.
  • Martijn Pieters
    Martijn Pieters almost 9 years
    Interestingly enough, file(1) itself excludes 0x7f from consideration as well, so technically speaking you should be using bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x7f)) + bytearray(range(0x80, 0x100)) instead. See Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file and github.com/file/file/blob/…
  • jfs
    jfs almost 9 years
    @MartijnPieters: thank you. I've updated the answer to exclude 0x7f (DEL) .
  • Martijn Pieters
    Martijn Pieters almost 9 years
    Nice solution using sets. :-)
  • melissa_boiko
    melissa_boiko over 8 years
    This broke my script :( Investigating, I found out that some conffiles are described by file as "Sendmail frozen configuration - version m"—notice the absence of the string "text". Perhaps use file -i?
  • darksky
    darksky almost 8 years
    Why do you exclude 11 or VT? In the table 11 is considered plain ASCII text, and this is the vertical tab.
  • jfs
    jfs almost 8 years
    @darksky : good catch. From the file(1) link: "I exclude vertical tab because it never seems to be used in real text." This behavior has changed between different file(1) versions (perhaps, the link should point to an earlier version). The method is just an heuristic, use whatever works best in your case.
  • abg
    abg about 7 years
    TypeError: cannot use a string pattern on a bytes-like object
  • Eric H.
    Eric H. about 7 years
    guess_types is based on the file name extension, not the real content as the Unix command "file" would do.
  • Mark Ransom
    Mark Ransom over 6 years
    Does Python guarantee the file will be immediately closed if you don't use a with statement to read those 1024 bytes?
  • jfs
    jfs over 6 years
    @MarkRansom to make sure a file is closed, use the with-statement or call .close() method explicitly.
  • Mark Ransom
    Mark Ransom over 6 years
    I only bring it up because you don't do either of those things in this answer.
  • jfs
    jfs over 6 years
    @MarkRansom it is just a REPL example. I'm sure files that you want to check are not called /usr/bin/python literally too.
  • UtahJarhead
    UtahJarhead almost 6 years
    \0 (null) auto fails because there should never be a null in a text file. Most text editors see that and that's where the text file is considered to end.
  • Anmol Singh Jaggi
    Anmol Singh Jaggi about 5 years
    This fails a for a lot of `.avi' (video) files.
  • jfs
    jfs almost 5 years
    @scott bytes is not str.
  • RobertG
    RobertG almost 5 years
    I can confirm, guess_type is based on the file extension. Also, in the example code, file is actually a string.
  • Terry
    Terry over 4 years
    why not using with open(filename, 'r', encoding='utf-8') as f directly?
  • Murtuza Z
    Murtuza Z about 4 years
    This method detect text file as Binary file if text file contains BOM UTF-16 LE
  • jfs
    jfs about 4 years
    @MurtuzaZ: It is expected for UTF-16, UTF-32 (they contain zero bytes).
  • DevPlayer
    DevPlayer about 3 years
    Aren't AVI video files binary? Or are you saying some AVI files get a return value of False from this is_binary()?