Extract images from PDF without resampling, in python?

150,671

Solution 1

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

see here for more resources

Solution 2

In Python with PyPDF2 and Pillow libraries it is simple:

from PIL import Image

from PyPDF2 import PdfReader


def extract_image(pdf_file_path):
    reader = PdfReader(pdf_file_path)
    page = reader.pages[0]
    x_object = page["/Resources"]["/XObject"].getObject()

    for obj in x_object:
        if x_object[obj]["/Subtype"] == "/Image":
            size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
            data = x_object[obj].getData()
            if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
                mode = "RGB"
            else:
                mode = "P"

            if x_object[obj]["/Filter"] == "/FlateDecode":
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif x_object[obj]["/Filter"] == "/DCTDecode":
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif x_object[obj]["/Filter"] == "/JPXDecode":
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

Solution 3

Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.

Solution 4

In Python with PyPDF2 for CCITTFaxDecode filter:

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
    return struct.pack(tiff_header_struct,
                       b'II',  # Byte order indication: Little indian
                       42,  # Version number (always 42)
                       8,  # Offset to first IFD
                       8,  # Number of tags in IFD
                       256, 4, 1, width,  # ImageWidth, LONG, 1, width
                       257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                       258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                       259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                       262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                       273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                       278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                       279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                       0  # last IFD
                       )

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
    page = cond_scan_reader.getPage(i)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            """
            The  CCITTFaxDecode filter decodes image data that has been encoded using
            either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
            designed to achieve efficient compression of monochrome (1 bit per pixel) image
            data at relatively low resolutions, and so is useful only for bitmap image data, not
            for color images, grayscale images, or general data.

            K < 0 --- Pure two-dimensional encoding (Group 4)
            K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
            K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
            """
            if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                if xObject[obj]['/DecodeParms']['/K'] == -1:
                    CCITT_group = 4
                else:
                    CCITT_group = 3
                width = xObject[obj]['/Width']
                height = xObject[obj]['/Height']
                data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                img_size = len(data)
                tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                img_name = obj[1:] + '.tiff'
                with open(img_name, 'wb') as img_file:
                    img_file.write(tiff_header + data)
                #
                # import io
                # from PIL import Image
                # im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

Solution 5

Libpoppler comes with a tool called "pdfimages" that does exactly this.

(On ubuntu systems it's in the poppler-utils package)

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

Windows binaries: http://blog.alivate.com.au/poppler-windows/

Share:
150,671

Related videos on Youtube

matt wilkie
Author by

matt wilkie

hewer of maps, old time techie, newbie developer personal - www.maphew.com, @maphew in-between - yukongis.ca work - www.env.gov.yk.ca, @mhw-at-yg

Updated on February 15, 2022

Comments

  • matt wilkie
    matt wilkie about 2 years

    How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

    I'm using python 2.7 but can use 3.x if required.

  • matt wilkie
    matt wilkie about 14 years
    thanks Ned. It looks like the particular pdf's I need this for are not using jpeg in-situ, but I'll keep your sample around in case it matches up other things that turn up.
  • Filipe Correia
    Filipe Correia almost 12 years
    Image magick uses ghostscript to do this. You can check this post for the ghostscript command that image magick uses under the covers.
  • Raffi
    Raffi over 8 years
    I have to say that sometimes the rendering is really bad. With poppler it works without any issue.
  • matt wilkie
    matt wilkie over 8 years
    Initially excited by this, but it threw up NotImplementedError: unsupported filter /DCTDecode or ... /JPXDecode from xObject[obj].getData() in the first couple pdf's I tested. Details at gist.github.com/maphew/fe6ba4bf9ed2bc98ecf5
  • sylvain
    sylvain over 8 years
    I have recently pushed the '/DCTDecode' modification to PyPDF2 library. You can use my repository: github.com/sylvainpelissier/PyPDF2 while it is integrated in the main branch.
  • matt wilkie
    matt wilkie over 8 years
    Thanks for the update but sorry, still no go. Gist updated. I get ValueError: not enough image data for dctdecode embedded images and unsupported filter /JPXDecode on another pdf.
  • sylvain
    sylvain over 8 years
    I updated it and it should work now. If you can we could add your pdf as an example in the PDF_Samples folder of PyPDF2.
  • matt wilkie
    matt wilkie over 8 years
    making headway! The dctdecode pdf's are processed without error now (though sometimes output images are upside down). However the JPXDecode file now throws KeyError:/Filter instead. I updated the gist accordingly. The PDF files are just random ones from the 'net. The gist has source links.
  • user3599803
    user3599803 over 7 years
    Can you please explain a few things in the code? For example, why would you search for "stream" first and then for startmark? you could just start searching the startmark as this is the start of JPG no? and what's the point of the startfix variable, you dont change it at all..
  • mlissner
    mlissner over 7 years
    "It is simple..."
  • Volatil3
    Volatil3 over 7 years
    @sylvain NotImplementedError: unsupported filter /CCITTFaxDecode I installed from pip INSTALLED: 1.26.0 (latest)
  • sylvain
    sylvain over 7 years
    @Volatil3 yes pip will install it from the official repository, my commit was not yet merged: github.com/mstamy2/PyPDF2/pull/237
  • crld
    crld over 7 years
    This worked immediately for me, and it's extremely fast!! All my images came out inverted, but I was able to fix that with OpenCV. I've been using ImageMagick's convert using subprocess to call it but it is painfully slow. Thanks for sharing this solution
  • user1717828
    user1717828 almost 7 years
    I would love if someone found a Python module that doesn't rely on pdfimages being installed on the subsystem.
  • matt wilkie
    matt wilkie almost 7 years
    That looks interesting. Where did you find it? (And, formatting in your post is a bit messed up. Unbalanced quotes I think.)
  • Max A. H. Hartvigsen
    Max A. H. Hartvigsen almost 7 years
    nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html you can find the original post here...
  • vishvAs vAsuki
    vishvAs vAsuki over 6 years
    A related question here..
  • Petri
    Petri over 6 years
    Finds the images for me, but they are cropped/sized wrong, all b&w and have horizontal lines :(
  • GrantD71
    GrantD71 over 6 years
    This code fails for me on '/ICCBased' '/FlateDecode' filtered images with img = Image.frombytes(mode, size, data) ValueError: not enough image data
  • Labo
    Labo over 6 years
    @GrantD71 I am not an expert, and never heard of ICCBased before. Plus your error is not reproducible if you don't provide the inputs.
  • matt wilkie
    matt wilkie over 6 years
    Thanks Colton. Homebrew is MacOS only. It's good practice to note OS when instructions are platform specific.
  • user1847
    user1847 over 6 years
    @mattwilkie -- Thanks for the heads up. Will note this in my answer.
  • Alok Nayak
    Alok Nayak about 6 years
    it doesn't output images pagewise
  • Basj
    Basj almost 6 years
    I get a KeyError: '/ColorSpace', so I would replace your line with DeviceRGB by if '/ColorSpace' not in xObject[obj] or xObject[obj]['/ColorSpace'] == '/DeviceRGB':. Anyway, this didn't work for me at the end because the images were probably PNG (not sure).
  • Basj
    Basj almost 6 years
    This works great! (pip install pymudf needed first obviously)
  • Labo
    Labo almost 6 years
    @Basj my code is supposed to work with PNG too. What is the value of xObject[obj]['/Filter']?
  • Basj
    Basj almost 6 years
    It is /CCITTFaxDecode. Then this code works. Erratum: I now see my files are a lot of .tiff files but not PNG
  • Labo
    Labo almost 6 years
    Perfect! It seems I had the same problem as the version I use is updated: dropbox.com/s/0w4wlifdu82mmaa/PDF_extract_images.py?dl=0
  • Gerald
    Gerald almost 6 years
    I adapted your code to work on both Python 2 and 3. I also implemented the /Indexed change from Ronan Paixão. I also changed the filter if/elif to be 'in' rather than equals. I had a PDF with the /Filter type ['/ASCII85Decode', '/FlateDecode']. I also changed the function to return image blobs rather than write to file. The updated code can be found here: gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a
  • Labo
    Labo almost 6 years
    @Gerald awesome, thanks! I'll look at the code and update my dropbox :)
  • VSZM
    VSZM over 5 years
    *pip install pymupdf for the fellow googlers who are wondering why the above install fails
  • Damotorie
    Damotorie over 5 years
    Instead of pip install pymupdf trying pip install PyMuPDF more info
  • Darius Mandres
    Darius Mandres over 5 years
    @GrantD71I have the same error on '/FlateDecode'. I can't make sense of it. Did you ever end up figuring it out? I created a test .pdf with 2 images inside. One .png, one .jpg. The .jpg one extracts just fine but the .pdf one gives this error.
  • Darius Mandres
    Darius Mandres over 5 years
    @Petri Had the same issue. Just use img = Image.frombytes('RGB', size, data). It works for .png/.jpg/.tiff files so far for me. Although, you may run into some problems I haven't fully tested all use cases.
  • Dispenser
    Dispenser about 5 years
    As pointed out elsewhere your tiff_header_struct should read '<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * 8 + 'L'. Note in particular the 'L' at the end.
  • Evan Mata
    Evan Mata about 5 years
    This package is quite helpful (and well documented) and deserves upvotes.
  • crash
    crash about 5 years
    Hi, to solve the NotImplementedError: unsupported filter /CCITTFaxDecode problem the library must be manually installed from the master branch of the github page? Installing it with pip install PyPDF2 won't work?
  • sylvain
    sylvain about 5 years
    Hi, it seems that the most maintained library nowadays is PyPDF4: github.com/claird/PyPDF4
  • Abhimanyu
    Abhimanyu about 5 years
    DCTDecode CCITTFaxDecode filters still not implemented.
  • hru_d
    hru_d about 5 years
    I followed @vishvAsvAsuki link but this packages gives images with white border, so removed it following this stackoverflow question
  • Aakash Basu
    Aakash Basu almost 5 years
    Any help on this please: stackoverflow.com/questions/55899363/…
  • Sha Li
    Sha Li almost 5 years
    Hi there, minecart works perfectly but I got a small problem: sometimes the layout of the images is changed (horizontal -> vertical). Do you have any idea how I could avoid this? Thanks!
  • vault
    vault over 4 years
    With this code I get RuntimeError: pixmap must be grayscale or rgb to write as png, can anyone help?
  • mxl
    mxl over 4 years
    Hello @Modem Rakesh goud, could you please provide the PDF file that triggered this error? Thank you!
  • Modem Rakesh goud
    Modem Rakesh goud over 4 years
    Unfortunately, I can not share that pdf.
  • mxl
    mxl over 4 years
    Or would you eventually be in the possession of a program like Acrobat (not Reader, but the PRO version), or alternatively another PDF editing program which can extract a portion of the PDF and provide only that portion, or, just give me the traceback.print_exc() of the given error line, so that I can see what triggered it; or maybe opt for another of the solutions here on this site, as the one given here (to my understanding) is focused on providing a 1:1 lossless extraction of data from a PDF and may not be what you are looking for, thanks!
  • Oringa
    Oringa about 4 years
    @vault This comment is outdated. You should change "if pix.n < 5" to "if pix.n - pix.alpha < 4" as the original condition does not correctly finds CMYK images.
  • matt wilkie
    matt wilkie about 4 years
    This is useful information and it should be documented and shared, as you have just done. +1. However I suggest posting as your own new question and then self-answer because it doesn't address doing this in python, which is point of this Q. (Feel free to cross-link the posts as this is related.)
  • Marco
    Marco about 4 years
    Hi @mattwilkie, thanks for the advice, here is the question: stackoverflow.com/questions/60851124/…
  • matt
    matt about 4 years
    This worked perfectly for the PDF I wanted to extract images from. (In case it helps anyone else, I saved his code as a .py file, then installed/used Python 2.7.18 to run it, passing the path to my PDF as the single command-line argument.)
  • Peter
    Peter about 4 years
    Best option IMO:After installing fitzon Win 10, I got the error: ModuleNotFoundError: No module named 'frontend', which was easily solved by installing pip install PyMuPDFas discussed here: stackoverflow.com/questions/56467667/…
  • Basj
    Basj almost 4 years
    After a few tests on many PDFs, neither @Sylvain's version, this version, nor Gerald's gist version works reliably, sadly. Still, big up for the effort!
  • Karol Zlot
    Karol Zlot almost 4 years
    Better version of this solution (with working /CCITTFaxDecode) can be found in PyPDF4 repository: github.com/claird/PyPDF4/blob/master/scripts/…
  • xax
    xax almost 4 years
    This code worked for me, with almost no modifications. Thank you.
  • MJeremy
    MJeremy over 3 years
    hi, I still don't get how the getData() fixed?
  • havlock
    havlock over 3 years
    This snippet may fail to find what look like images but aren't. The package author has a helpful response to this at github.com/pymupdf/PyMuPDF/issues/469
  • Javi12
    Javi12 over 3 years
    With minecart I get: pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode
  • Hobbes
    Hobbes about 3 years
    I tested this and it does exactly what I needed, thanks!. One point, filter = raw_image.stream_dict.Filter gives an error because filter is a function. When I change the name, I still get an error, NotImplementedError: don't know how to __str__ this object. I haven't been able to figure out what datatype .filter has.
  • andrewdotn
    andrewdotn about 3 years
    Thanks for the comment. I’ve renamed filter to f to avoid the collision with Python’s built-in filter() function. raw_image.stream_dict.Filter is an instance of pikepdf.objects.Object for me; it seems to have a to_json() method you could try if str() isn’t doing what you want. But the PDF spec also indicates Filter may also be a list which might be part of what you’re seeing? That would be specific to the PDF you’re trying it on. You could try print(type(f)) and print(dir(f)) to see f’s type, attributes, and methods.
  • Matthias Fripp
    Matthias Fripp almost 3 years
    This looks like it is now the easiest and most effective answer. I wish I'd seen it before I tried to implement this using PyPDF! One thing to mention: pikepdf crashed when I tried to export JBIG2 data, so then I installed jbig2dec (conda install jbig2dec) and it worked well. The code above saves image data directly if possible (DCTDecode > jpg, JPXDecode > jp2, CCITTFaxDecode > tif), and otherwise saves in a lossless PNG (JBIG2Decode, FlateDecode). I don't think you can do much better than that.
  • Matthias Fripp
    Matthias Fripp almost 3 years
    This doesn't work with either PyPDF2 or PyPDF4. (1) It doesn't handle the /JBIG2Decode filter. You can partly fix that by saving the raw image data (data = xObject[obj]._data instead of data = xObject[obj].getData()). But jbig2 files are not widely supported, so this is not very useful. (2) The /CCITTFaxDecode filter also crashes in some cases (seems to happen because some PDFs store DecodeParams as an array of dictionaries but PyPDF expects a single dictionary). The PikePDF solution works much better.
  • Matthias Fripp
    Matthias Fripp almost 3 years
    If you want a more "Pythonic" approach, you can also use the PikePDF solution in another answer. If you install jbig2dec (can be done with conda), that will also convert jbig2 images to png automatically.
  • user3072843
    user3072843 almost 3 years
    This will convert the PDF into images, but it does not extract the images from the remaining text.
  • Steve Gon
    Steve Gon over 2 years
    FYI this package is more than 5 years old and has not been updated since 2016.
  • Azhar Uddin Sheikh
    Azhar Uddin Sheikh over 2 years
    display is not defined
  • rmutalik
    rmutalik over 2 years
    @matt wilkie the problem is not with sylvain's answer. If you trace back the code, you will see that the author of PyPDF2 did not implement those 2 filters, as seen in this link on lines 348 and 353: github.com/mstamy2/PyPDF2/blob/master/PyPDF2/filters.py
  • Rufat
    Rufat over 2 years
    For Windows, I compiled the jbig2dec file using Visual Studio and placed it in the Windows directory. The source code is here: jbig2dec.com. In the bat file: call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars32.bat" "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.30.30704\bin\Hostx86\‌​x86\nmake.exe" msvc.mak
  • swestrup
    swestrup over 2 years
    pdfimages often fails for images that are composed of layers, outputting individual layers rather than the image-as-viewed.
  • swestrup
    swestrup over 2 years
    I tried this on a 56-page document full of images, and it only found ONE image on page 53. No idea what the issue is.
  • swestrup
    swestrup over 2 years
    I get AttributeError: module 'pdfminer.pdfparser' has no attribute 'PDFDocument'
  • Shreyansh Dwivedi
    Shreyansh Dwivedi about 2 years
    pyPDF2 library does not work any more in python version above 3.6, when it is used resultant out with dependency errors
  • sol
    sol about 2 years
    i had to also pip install fitz
  • settwi
    settwi about 2 years
    maybe this is obvious, but you can also import sys and use sys.argv[1] instead of hard-coding a file name if you want to have a drag-and-drop script solution :)