How do I distinguish between 'binary' and 'text' files?

48,609

Solution 1

The spreadsheet software my company makes reads a number of binary file formats as well as text files.

We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

Solution 2

You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.

file README
README: ASCII English text, with very long lines

file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped

Solution 3

You can determine the MIME type of the file with

file --mime FILENAME

The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).

If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.

Solution 4

  • To list text file names in current dir/subdirs:

    grep -rIl ''
    
  • Binaries:

    grep -rIL ''
    
  • To check for a particular file:

    grep -qI '' FILE
    

    then, exit status '0' would mean the file is a text; '1' - binary. To check:

    echo $?
    

Key option is this:

  -I     Process a binary file as if it did not contain matching data;

Other options:

  -r, --recursive
         Read all files under each directory, recursively;
  -l, --files-with-matches
         Suppress normal output; instead print the name of each input file from which output would normally have been printed.
  -L, --files-without-match
         Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
  -q, --quiet, --silent
         Quiet; do not write anything to standard output.  Exit immediately with zero status if any match is found, even if an error was detected.

Solution 5

Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:

$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'

(Note that those underscores without a preceding dollar are correct (RTFM).)

Share:
48,609
erickthered
Author by

erickthered

Updated on July 05, 2022

Comments

  • erickthered
    erickthered over 1 year

    Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).

    In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.

    However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.

    So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.

  • MSN
    MSN almost 15 years
    Well if it doesn't follow those rules then it really isn't a text file. Except for mbcs, but that's an entirely different story.
  • Adam Lassek
    Adam Lassek almost 15 years
    +1 If it's a Linux system, file is going to have much better heuristics than anything you'll build yourself.
  • erickthered
    erickthered almost 15 years
    I think that should be "file -I" (upper case). At least according to my tests and man page.
  • erickthered
    erickthered almost 15 years
    Yeah, if file is available, it is going to be the best tool for the job. No question! Also the 'file -I' is a neat trick. I hadn't thought of shelling out for my particular problem, however I don't think I could cop the performance overhead. Thanks!
  • phihag
    phihag almost 15 years
    Just looked it up, lower case is correct in Debian and gentoo Linux. Their file is ftp.astron.com/pub/file/file-5.00.tar.gz (or a different version). -I(upper) is an option in neither one.
  • erickthered
    erickthered almost 15 years
    Huh, weird. The version on OS X (4.17) uses -I (upper) and the one on my Linux boxes (4.24) uses -i (lower). How bizzare! I wonder if it is an OS X-ism, or the authors simply changed the interface in between point release.
  • Breton
    Breton over 12 years
    unless it's utf-16, or utf32. then there's lots.
  • Deduplicator
    Deduplicator over 9 years
    Prepending a BOM to UTF-8 files is not encouraged by the Unicode standard, and it's a pity they don't forbid it outright. Also, those other formats don't neccessarily have one.
  • verboze
    verboze about 8 years
    osx has two variants for this: lowercase -i will print type without classification (e.g., file, directory); uppercase -I will print classification, similar to what you would expect on an linux system. You will want to use uppercase -I for this to work on that platform
  • Daniel Cassidy
    Daniel Cassidy over 6 years
    -1 because this relies on the text file being encoded in a Unicode encoding and having a Byte Order Mark. In practice UTF-8 text files usually don’t, and UTF-8 is the most common Unicode encoding. The answer should at least explain this limitation.
  • Daniel
    Daniel over 6 years
    I tested it on files generated by dd and by nano. Your method works great. I am also interested why there was down votes.
  • anishpatel
    anishpatel about 6 years
    file --mime seems to be consistent for both Linux and macOS. The POSIX spec for file has -i as a different option, so macOS uses -I to remain POSIX compliant.
  • GNUSupporter 8964民主女神 地下教會
    GNUSupporter 8964民主女神 地下教會 over 5 years
    Thanks for great answer. It deserves upvotes. Combined with if..then conditionals, for loop and/or find, it can automate stuff and becomes very powerful.
  • Poul Bak
    Poul Bak over 3 years
    On IIS javascript files are served as: application/javascript, so it's not that simple!
  • link89
    link89 over 1 year
    This should be accepted answer. It works well.