How to determine file type?

21,815

Solution 1

Well, the most robust way would be to write a parser for the file types you want to detect and then just try – if there are no errors, it's obviously of the type you tried. This is an expensive approach, however, but it would ensure that you can successfully load the file as well since it will also check the rest of the file for semantic soundness.

A much less expensive variant would be to look for “magic” bytes – signatures at the start or known offsets of the file. For example, if a file starts with an ID3 tag you can be reasonably sure it's an MP3 file. If a file starts with RIFF¼↕☻ WAVEfmt, then it's a WAV file. However, such detection cannot guarantee you that the file is really of that type – it could just be the signature and following that garbage.

Solution 2

While you can use the extension to make a reasonable guess as to what the file is it's not guaranteed to work 100% of the time. If you are targeting Windows then it will work 99.9% of the time as that's how Windows keeps track of what file is what type.

If you are getting your files from non-Windows sources the only sure way is to open the file and look for a specific string or set of bytes which will unambiguously identify it. For example, you could look for the ID3 tags in an mp3 file:

The ID3v1 tag occupies 128 bytes, beginning with the string TAG.

or

ID3v2 tags are of variable size, and usually occur at the start of the file, to aid streaming media.

How far you go depends on how robust you want your solution to be, and does rely on there being a header or pattern that's always present.

Doing it this way can help guard against malicious content where someone posts a piece of malware as a mp3 file (say) and hopes that it will just be run by a program prone to some exploit (a buffer overrun perhaps).

Share:
21,815
Sergey
Author by

Sergey

Updated on April 30, 2020

Comments

  • Sergey
    Sergey about 4 years

    I need to know if my file is audio file: mp3, wav, etc...
    How to do this?

    • Cody Gray
      Cody Gray over 13 years
      What determines a file type for you, aside from the extension? Not every file has metadata that specifies its type.
    • Sergey
      Sergey over 13 years
      a file header, that determines a file type. Audio file, video file, djvu file, etc...
    • Cody Gray
      Cody Gray over 13 years
      The easy answer is just to open the file as a byte stream and read in the first 20 or so bytes, then. But you should be careful, because there's no real standard for how signature bytes are stored in a file's header. You're going to either have to test all common cases or have some stellar documentation available.
  • Sergey
    Sergey over 13 years
    No, it returns file extension. But I need to know file type
  • Sergey
    Sergey over 13 years
    I can change extension as I want, but I need to know file type.
  • Mr. Smith
    Mr. Smith over 13 years
    What happens if the lib/codec he's using can play all of those file formats?
  • Cody Gray
    Cody Gray over 13 years
    +1 for pointing out that file extensions are how Windows determines a file's type (and therefore which application it should open in). If this method breaks, chances are the file is "broken" to the user anyway because they can't open it in Windows Explorer. Mac OS X used to handle this differently, but since 10.6, they switched over to the dark side with file extensions as the primary metadata used for associating files with their creators.
  • ChrisF
    ChrisF over 13 years
    @Cody - interesting point about Macs, I thought they used the "unix" method.
  • Cody Gray
    Cody Gray over 13 years
    It's actually quite a bit more complicated to preserve backwards compatibility with the bifurcated method used in pre-OS X of embedding both a type code and a creator code in the resource fork. In 10.4, Apple started using a Uniform Type Identifier, which is something we in Windows world can only dream of. Up until 10.6, however, a file was still opened based on its creator code if it was present, but this behavior has since been dropped, and all documents (even those with legacy creator codes) use the file extension exclusively.