How can I determine if a file is a PDF file?
Solution 1
you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)
I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)
Solution 2
Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:
public static void CheckIsPDF(byte[] data)
{
Assert.IsNotNull(data);
Assert.Greater(data.Length,4);
// header
Assert.AreEqual(data[0],0x25); // %
Assert.AreEqual(data[1],0x50); // P
Assert.AreEqual(data[2],0x44); // D
Assert.AreEqual(data[3],0x46); // F
Assert.AreEqual(data[4],0x2D); // -
if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
{
// file terminator
Assert.AreEqual(data[data.Length-7],0x25); // %
Assert.AreEqual(data[data.Length-6],0x25); // %
Assert.AreEqual(data[data.Length-5],0x45); // E
Assert.AreEqual(data[data.Length-4],0x4F); // O
Assert.AreEqual(data[data.Length-3],0x46); // F
Assert.AreEqual(data[data.Length-2],0x20); // SPACE
Assert.AreEqual(data[data.Length-1],0x0A); // EOL
return;
}
if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
{
// file terminator
Assert.AreEqual(data[data.Length-6],0x25); // %
Assert.AreEqual(data[data.Length-5],0x25); // %
Assert.AreEqual(data[data.Length-4],0x45); // E
Assert.AreEqual(data[data.Length-3],0x4F); // O
Assert.AreEqual(data[data.Length-2],0x46); // F
Assert.AreEqual(data[data.Length-1],0x0A); // EOL
return;
}
Assert.Fail("Unsupported file format");
}
Solution 3
Since you use PDFBox you can simply do:
PDDocument.load(file);
It'll fail with an Exception if the PDF is corrupted etc.
If it succeeds you can also check if the PDF is encrypted using .isEncrypted()
Solution 4
Here an adapted Java version of NinjaCross's code.
/**
* Test if the data in the given byte array represents a PDF file.
*/
public static boolean is_pdf(byte[] data) {
if (data != null && data.length > 4 &&
data[0] == 0x25 && // %
data[1] == 0x50 && // P
data[2] == 0x44 && // D
data[3] == 0x46 && // F
data[4] == 0x2D) { // -
// version 1.3 file terminator
if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
data[data.length - 7] == 0x25 && // %
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x45 && // E
data[data.length - 4] == 0x4F && // O
data[data.length - 3] == 0x46 && // F
data[data.length - 2] == 0x20 && // SPACE
data[data.length - 1] == 0x0A) { // EOL
return true;
}
// version 1.3 file terminator
if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x25 && // %
data[data.length - 4] == 0x45 && // E
data[data.length - 3] == 0x4F && // O
data[data.length - 2] == 0x46 && // F
data[data.length - 1] == 0x0A) { // EOL
return true;
}
}
return false;
}
And some simple unit tests:
@Test
public void test_valid_pdf_1_3_data_is_pdf() {
assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}
@Test
public void test_valid_pdf_1_4_data_is_pdf() {
assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}
@Test
public void test_invalid_data_is_not_pdf() {
assertFalse(is_pdf("Hello World".getBytes()));
}
If you come up with any failing unit tests, please let me know.
Solution 5
I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.
Eventually, after tinkering around with different methods in the API, I tried this:
PDDocument.load(file).getPage(0).getContents().toString();
This did not throw an exception, but it did output this:
WARN [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.
To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).
I then implemented the following:
RandomAccessFile accessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(accessFile);
parser.setLenient(false);
parser.parse();
This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!
![Admin](/assets/logo_square_200-5d0d61d6853298bd2a4fe063103715b4daf2819fc21225efa21dfb93e61952ea.png)
Admin
Updated on July 09, 2022Comments
-
Admin almost 2 years
I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?
-
Admin about 15 yearsI have tried this, and it appears that PDF Files can use a variety of encodings and the text read sometimes does not match %PDF for valid and readable PDF files.
-
Kyle W. Cartmell about 15 yearsNot all files that begin with %PDF are valid PDF files.
-
Persimmonium about 15 yearsOh, I forgot to mention there is now an apache project for text extraction, lucene.apache.org/tika, in case you prefer it to aperture
-
Michael Greene about 14 yearsThanks, this just helped me figure out what was going wrong with the PDF I was generating -- an EOL problem only showed in Adobe Reader, not Foxit/GoogleApps/Sumatra.
-
Persimmonium over 13 yearsread the question properly: the question was NOT about using PDFBox, but on a way to 'check if the provided file is indeed a valid PDF'
-
cherouvim over 13 yearsI see "using PdfBox by Apache" in the question's title. If the problem is solvable using PDFBox isn't it better than by introducing extra dependencies?
-
cherouvim over 13 yearsIs this in Java? Also it'll not detect encrypted PDFs. Since the OP wants to extract info you need that too.
-
MonkeyWrench about 11 yearsFrom what I've seen, that's not true. I can use PDDocument.load( stream ) to load a corrupted PDF. I only get an error when attempting to save the PDF after modifying it's permissions.
-
Oli Dev almost 11 yearsUsing Exceptions for application flow is bad practice.
-
cherouvim almost 11 years@BenTurner: You are correct and I am with you on that. The API doesn't give us a way to check for file validity though.
-
sahana almost 11 yearsThanks! I really appreciate that this answer is library agnostic. It saved me a bunch of time =)
-
Aleksei Nikolaevich over 10 yearsThis does not always throw an exception. stackoverflow.com/questions/20004290/…
-
boumbh almost 9 yearsThis answer troubles me... Are there PDF that does not begin with "%PDF-" but just contains it ? Why the trouble of reading the whole file ? What if I check a 2 GB zip file ?
-
shanraisshan almost 8 yearsfor larger files, files with size of 10+MB and wrong extensions (for example mp3File.pdf), it will take a lot of time (like 5 or more seconds)
-
Vering almost 8 yearsWhat about PDDocument.load(file).getNumberOfPages() ? This is what I do and I have not yet experienced a non-valid PDF-file where PDFBox could count the number of pages.
-
Mohsen Abasi about 7 yearsI have a corrupted PDF. It will be identified as corrupted by iText but not by PDDocument.load(file) of PDFBox!!
-
mkl about 7 yearsThe
%%EOF
must be the only content of the last line of the PDF. Thus, files with a space after the%%EOF
strictly speaking are invalid. There only may be a line delimiter after it, i.e. a single CR, a single LF, or a CR LF pair. -
ssimm over 5 yearsWhy the downvote? This does answer the question. Maybe this solution is not as robust as the other answers but then the others should be upvoted more, no?
-
Mr. Polywhirl over 5 yearsIn version 1.3, the space after EOF does not always appear before the EOL.
-
mkl over 5 yearsISO 32000-2 has been published for quite a while now. So... " I also tailored this to check between PDF versions 1.3 and 1.7" - you should also allow 2.0.
-
Mr. Polywhirl over 5 years@mkl I removed the check for version. There may be an issue displaying the version of the format changes from
x.y
. A safer check would be to look between the percentage signs e.g.%xx.yyy%
. -
1_bug about 5 yearsFor now (almost decade after original answer was placed) we have much more pdf versions, so be carefully if you intend just copy and paste above code!
-
Danielson Alves Júnior over 4 years@1_bug you foreshadowing! I had a problema with the 1.6 format, for now, just checking the "25 50 44 46 2D" group!