How can I determine if a file is a PDF file?

63,205

Solution 1

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

Solution 2

Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

public static void CheckIsPDF(byte[] data)
    {
        Assert.IsNotNull(data);
        Assert.Greater(data.Length,4);

        // header 
        Assert.AreEqual(data[0],0x25); // %
        Assert.AreEqual(data[1],0x50); // P
        Assert.AreEqual(data[2],0x44); // D
        Assert.AreEqual(data[3],0x46); // F
        Assert.AreEqual(data[4],0x2D); // -

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
        {                  
            // file terminator
            Assert.AreEqual(data[data.Length-7],0x25); // %
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x45); // E
            Assert.AreEqual(data[data.Length-4],0x4F); // O
            Assert.AreEqual(data[data.Length-3],0x46); // F
            Assert.AreEqual(data[data.Length-2],0x20); // SPACE
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
        {
            // file terminator
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x25); // %
            Assert.AreEqual(data[data.Length-4],0x45); // E
            Assert.AreEqual(data[data.Length-3],0x4F); // O
            Assert.AreEqual(data[data.Length-2],0x46); // F
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        Assert.Fail("Unsupported file format");
    }

Solution 3

Since you use PDFBox you can simply do:

PDDocument.load(file);

It'll fail with an Exception if the PDF is corrupted etc.

If it succeeds you can also check if the PDF is encrypted using .isEncrypted()

Solution 4

Here an adapted Java version of NinjaCross's code.

/**
 * Test if the data in the given byte array represents a PDF file.
 */
public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4 &&
            data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 && // SPACE
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 && // F
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }
    }
    return false;
}

And some simple unit tests:

@Test
public void test_valid_pdf_1_3_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}

@Test
public void test_valid_pdf_1_4_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}

@Test
public void test_invalid_data_is_not_pdf() {
    assertFalse(is_pdf("Hello World".getBytes()));
}

If you come up with any failing unit tests, please let me know.

Solution 5

I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

Eventually, after tinkering around with different methods in the API, I tried this:

PDDocument.load(file).getPage(0).getContents().toString();

This did not throw an exception, but it did output this:

 WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015

Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

I then implemented the following:

        RandomAccessFile accessFile = new RandomAccessFile(file, "r");
        PDFParser parser = new PDFParser(accessFile); 
        parser.setLenient(false);
        parser.parse();

This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

Share:
63,205
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?

  • Admin
    Admin about 15 years
    I have tried this, and it appears that PDF Files can use a variety of encodings and the text read sometimes does not match %PDF for valid and readable PDF files.
  • Kyle W. Cartmell
    Kyle W. Cartmell about 15 years
    Not all files that begin with %PDF are valid PDF files.
  • Persimmonium
    Persimmonium about 15 years
    Oh, I forgot to mention there is now an apache project for text extraction, lucene.apache.org/tika, in case you prefer it to aperture
  • Michael Greene
    Michael Greene about 14 years
    Thanks, this just helped me figure out what was going wrong with the PDF I was generating -- an EOL problem only showed in Adobe Reader, not Foxit/GoogleApps/Sumatra.
  • Persimmonium
    Persimmonium over 13 years
    read the question properly: the question was NOT about using PDFBox, but on a way to 'check if the provided file is indeed a valid PDF'
  • cherouvim
    cherouvim over 13 years
    I see "using PdfBox by Apache" in the question's title. If the problem is solvable using PDFBox isn't it better than by introducing extra dependencies?
  • cherouvim
    cherouvim over 13 years
    Is this in Java? Also it'll not detect encrypted PDFs. Since the OP wants to extract info you need that too.
  • MonkeyWrench
    MonkeyWrench about 11 years
    From what I've seen, that's not true. I can use PDDocument.load( stream ) to load a corrupted PDF. I only get an error when attempting to save the PDF after modifying it's permissions.
  • Oli Dev
    Oli Dev almost 11 years
    Using Exceptions for application flow is bad practice.
  • cherouvim
    cherouvim almost 11 years
    @BenTurner: You are correct and I am with you on that. The API doesn't give us a way to check for file validity though.
  • sahana
    sahana almost 11 years
    Thanks! I really appreciate that this answer is library agnostic. It saved me a bunch of time =)
  • Aleksei Nikolaevich
    Aleksei Nikolaevich over 10 years
    This does not always throw an exception. stackoverflow.com/questions/20004290/…
  • boumbh
    boumbh almost 9 years
    This answer troubles me... Are there PDF that does not begin with "%PDF-" but just contains it ? Why the trouble of reading the whole file ? What if I check a 2 GB zip file ?
  • shanraisshan
    shanraisshan almost 8 years
    for larger files, files with size of 10+MB and wrong extensions (for example mp3File.pdf), it will take a lot of time (like 5 or more seconds)
  • Vering
    Vering almost 8 years
    What about PDDocument.load(file).getNumberOfPages() ? This is what I do and I have not yet experienced a non-valid PDF-file where PDFBox could count the number of pages.
  • Mohsen Abasi
    Mohsen Abasi about 7 years
    I have a corrupted PDF. It will be identified as corrupted by iText but not by PDDocument.load(file) of PDFBox!!
  • mkl
    mkl about 7 years
    The %%EOF must be the only content of the last line of the PDF. Thus, files with a space after the %%EOF strictly speaking are invalid. There only may be a line delimiter after it, i.e. a single CR, a single LF, or a CR LF pair.
  • ssimm
    ssimm over 5 years
    Why the downvote? This does answer the question. Maybe this solution is not as robust as the other answers but then the others should be upvoted more, no?
  • Mr. Polywhirl
    Mr. Polywhirl over 5 years
    In version 1.3, the space after EOF does not always appear before the EOL.
  • mkl
    mkl over 5 years
    ISO 32000-2 has been published for quite a while now. So... " I also tailored this to check between PDF versions 1.3 and 1.7" - you should also allow 2.0.
  • Mr. Polywhirl
    Mr. Polywhirl over 5 years
    @mkl I removed the check for version. There may be an issue displaying the version of the format changes from x.y. A safer check would be to look between the percentage signs e.g. %xx.yyy%.
  • 1_bug
    1_bug about 5 years
    For now (almost decade after original answer was placed) we have much more pdf versions, so be carefully if you intend just copy and paste above code!
  • Danielson Alves Júnior
    Danielson Alves Júnior over 4 years
    @1_bug you foreshadowing! I had a problema with the 1.6 format, for now, just checking the "25 50 44 46 2D" group!