How can I extract images and their metadata from PDFs?

12,702

Solution 1

Images do not contain metadata and are stored as raw data which needs to be assemebled into images. I wrote 2 blog posts explaining how image data is stored in a PDF file at https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-how-are-images-stored/ and https://blog.idrsolutions.com/2010/09/understanding-the-pdf-file-format-images/

Solution 2

I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

And the "Helper methods":

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

Note: This is a quick and dirty proof of concept and not a well-styled code.

The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.

I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).

Solution 3

Short Answer

Maybe, but probably not.

Long Answer

PDF natively supports JPEG, JPEG2000 (which is growing more common), CITT (fax) 3 & 4, and JBIG2 (really rare). Images in these formats can be copied byte-for-byte into the PDF, preserving any metadata WITHIN THE FILE. Creation/change dates are generally part of the file system, not the image.

JPEG: doesn't look like it supports internal metadata.

JPEG2000: Yep. Lots of stuff in there potentially

CITT: doesn't look that way.

JBIG2: Err.. I think so, but it's none to clear from the specs I've just skimmed.

All other image formats must be turned into pixels and then compressed In Some Way (often with Flate/ZIP). These conversions could keep the metadata as part of the PDF's xml metadata or the image's dictionary, but I've never even heard of that happening. It just gets pitched.

Solution 4

The original creation and modification dates are generally not saved when the image is embedded into the PDF. Just the raw pixel data is compressed and saved. However, according to Wikipedia:

Raster images in PDF (called Image XObjects) are represented by dictionaries with an associated stream.

The dictionary contains meta data, amongst which you might find the dates.

Share:
12,702
sean
Author by

sean

Updated on June 15, 2022

Comments

  • sean
    sean almost 2 years

    Is it possible to use Java to extract images from a PDF file and export them to a specific folder without losing their original creation and modification dates? I tried to achieve this goal by using IText and PDFBox but had no success. Any ideas or examples are welcome.

  • Mark Storer
    Mark Storer about 13 years
    But probably not... see my answer.
  • Erik
    Erik almost 13 years
    I haven't read your blog, but I don't agree. XMP-Metadata of images is stored within the PDF and can be read again. I'm using this method in a highly productional environment and it works perfectly.
  • mark stephens
    mark stephens almost 13 years
    You can store metadata within the PDF but metadata within the image is lost. Lots of archive systems use metatags within JPEGs which is lost unless the PDF creation tool specifically includes it.
  • LoMaPh
    LoMaPh over 5 years
    Links are dead.