How are file types known if not from file suffix?

74

Solution 1

The file utility determines the filetype over 3 ways:

First the filesystem tests: Within those tests one of the stat family system calls is invoked on the file. This returns the different unix file types: regular file, directory, link, character device, block device, named pipe or a socket. Depending on that, the magic tests are made.

The magic tests are a bit more complex. File types are guessed by a database of patterns called the magic file. Some file types can be determined by reading a bit or number in a particular place within the file (binaries for example). The magic file contains "magic numbers" to test the file whether it contains them or not and which text info should be printed. Those "magic numbers" can be 1-4Byte values, strings, dates or even regular expressions. With further tests additional information can be found. In case of an executable, additional information would be whether it's dynamically linked or not, stripped or not or the architecture. Sometimes multiple tests must pass before the file type can be truly identified. But anyway, it doesn't matter how many tests are performed, it's always just a good guess.

Here are the first 8 bytes in a file of some common filetypes which can help us to get a feeling of what these magic numbers can look like:

             Hexadecimal          ASCII
PNG   89 50 4E 47|0D 0A 1A 0A   ‰PNG|....
JPG   FF D8 FF E1|1D 16 45 78   ÿØÿá|..Ex
JPG   FF D8 FF E0|00 10 4A 46   ÿØÿà|..JF
ZIP   50 4B 03 04|0A 00 00 00   PK..|....
PDF   25 50 44 46|2D 31 2E 35   %PDF|-1.5

If the file type can't be found over magic tests, the file seems to be a text file and file looks for the encoding of the contents. The encoding is distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

The line breaks are also investigated, depending on their HEX values:

  • 0A (\n) classifies a Un*x/Linux/BSD/OSX terminated file
  • 0D 0A (\r\n) are file from Microsoft operating systems
  • 0D (\r) would be Mac OS until version 9
  • 15 (\025) would be IBMs AIX

Now the language tests start. If it appears to be a text file, the file is searched for particular strings to find out which language it contains (C, Perl, Bash). Some script languages can also be identified over the hashbang (#!/bin/interpreter) in the first line of the script.

If nothing applies to the file, the file type can't be determined and file just prints "data".

So, you see there is no need for a suffix. A suffix anyway could confuse, if set wrong.

Solution 2

Often, it doesn't care. You just pass it to a program and either it interprets it or it doesn't. It may not be useful to open a .jpg in a text editor, but you're not prevented from doing this. The extension, like the rest of the filename, is for the organisational convenience of humans.

It may also be possible to construct files that can be validly interpreted in multiple ways. Because the ZIP file format starts has a header at the end of the file, you can prepend other things to the front and it will still load as a ZIP file. This is commonly used to make self-extracting zip files.

Solution 3

That information is commonly found in the header of the file. The file command analyzes the target and tells you information about the file. A lot of information is often derived from file headers which are often times the first few bytes of a file (see below). Headers are used by the system to figure out how to handle files. #!/bin/bash at the beginning of a file tells the system to use the bash shell to interpret the following script. ELF tells the system that this is an ELF executable.

[~] root@www # file /bin/ls
/bin/ls: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped

[~] root@www # file /etc/passwd
/etc/passwd: ASCII text

File header examples:

[root@server4 ~]# xxd old_sm_logo.png | head -5
0000000: 8950 4e47 0d0a 1a0a 0000 000d 4948 4452  .PNG........IHDR
0000010: 0000 0134 0000 006f 0806 0000 0062 bf3c  ...4...o.....b.<

[root@server4 ~]# xxd /bin/ls | head -5
0000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
0000010: 0200 3e00 0100 0000 a024 4000 0000 0000  ..>......$@.....

[root@server4 proj]# xxd resizer.sh | head -5
0000000: 2321 2f62 696e 2f62 6173 680a 5b20 2d7a  #!/bin/bash.[ -z
0000010: 2022 2431 2220 5d20 2626 2065 6368 6f20   "$1" ] && echo

Solution 4

The first thing to check is the hard-coded file type that is recognized by the kernel. These are the file types such as directory, character-special file, block-special file, pipe-special file, socket, and symbolic link. This information comes from the inode of the file. If the file is a plain file, the next set of information comes from the first 256 bytes by looking for patterns. Thus, the text files and C source code are recognized by examining those bytes. In addition, the utilities also look for a magic number that is used to test and validate the file type. You can add your own file types to be recognized by adding the information to the file /etc/magic. Refer to the man page for magic(5) to see the format of the magic file.

In older implementation (Solaris, for example), the file /etc/magic enumerated most of the file types recognized.

Solution 5

The file command applies some heuristics from inspecting (parts of) the file and making a qualified guess. Beyond that there are some special cases where additional information can be obtained; like the #! at the beginning of a text file, a BoM (byte order mark), or specific header bytes of executable file formats. The #! and binary marks in executables are used by the system to tell them apart.

Share:
74

Related videos on Youtube

marked-down
Author by

marked-down

Updated on September 18, 2022

Comments

  • marked-down
    marked-down almost 2 years

    At some point, I'm not sure when, the following shorthand property became available to reference an element in an array that became returned by a method:

    echo $someObject->SomeMethod()['SomeElement'];
    

    Where you append the element name after the method parentheses but before the semi-colon. What PHP version was this made available in?

    • orion
      orion about 9 years
      Just a comment, the rest of the answers cover everything. Nowadays it may happen that with a misconfigured locale or old executables, some utf-8 files may be misdetected as binary data due to non-ascii bytes.
    • jwodder
      jwodder about 9 years
      The system doesn't care. Certain applications may care, but they each have their own ways of handling this.
    • Admin
      Admin about 9 years
      Note that even for regular files (not device files, unix domain sockets, named pipes, etc) "file type" can mean two different things: (1) A particular file format (".docx", XML, MS-DOS text format, RTF, fixed-length records, the list could be very long) or (2) A file that a particular app knows how to deal with (".xlsx" or ".doc" or whatever, there's overlap with the format type). It's worth keeping that distinction in mind when talking about "file type".
    • Mr Lister
      Mr Lister about 9 years
      @jwodder The system does care. It's the system that complains you can't execute a non-executable file when you try to, not those applications!
    • user2338816
      user2338816 about 9 years
      @MrLister True, but executable/non-executable has nothing to do with 'extension'.
  • marked-down
    marked-down over 10 years
    Ahh, is that the name for it. Thank you.
  • lcd047
    lcd047 about 9 years
    There's also the freedesktop.org shared MIME database, which is used by essentially all X11 applications. This is similar in concept to what file(1) does, but with a (very) different implementation.
  • Nate Eldredge
    Nate Eldredge about 9 years
    This is rather misleading. Unix files don't have a "header" per se. The file command tries to guess from the contents of the file how the file is probably intended to be used. It is not infallible.
  • Nate Eldredge
    Nate Eldredge about 9 years
    But your answer makes it sound like a header is an inherent feature of a Unix file. Text files, for instance, have no such header; someone like the OP would probably consider a C source file and a Java source file to have different "file types", but there is no header to distinguish them. I would argue that "file type" is not even a meaningful concept under Unix; the operating system just provides a file system, and it is up to each application to decide what the contents of any given file mean.
  • h3rrmiller
    h3rrmiller about 9 years
    I agree. I was trying to answer as simply as possible without going down too many rabbit holes.
  • user253751
    user253751 about 9 years
    Note that the result of this process is basically a guess, and shouldn't be relied upon for anything important. (Convenience features, like deciding the default program to open the file with, are fine)
  • Hagen von Eitzen
    Hagen von Eitzen about 9 years
    Re the last paragraph: Funky File Formats is an interesting talk on that subject, presenting e.g. a jpeg that is also a java hello world program, after AES encrypting it it becomes a PNG, or after 3DES decrypting it it becomes a PDF and more (all with "interesting" content, i.e. not just with white noise or artefacts)
  • Mark
    Mark about 9 years
    As file formats go, XPM isn't that tricky. I consider "tricky" to start with something that's both a valid JPEG and a valid ZIP file.
  • saga
    saga over 7 years
    So if I add %PNG at the top of a text file, it will be seen as a png file. Right??
  • Bananguin
    Bananguin about 5 years
    @saga If you get the encoding right and if you put a per mille sign instead of a per cent sign then: maybe. There may be additional tests.
  • Vorac
    Vorac almost 4 years
    And this can be useful in steganography. Rename a picture to .zip and it will extract whatever you add to it, but not the picture.