Extract images from PDF without resampling, in python?

python image pdf extract pypdf

150,671

Solution 1

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

see here for more resources

Solution 2

In Python with PyPDF2 and Pillow libraries it is simple:

from PIL import Image

from PyPDF2 import PdfReader


def extract_image(pdf_file_path):
    reader = PdfReader(pdf_file_path)
    page = reader.pages[0]
    x_object = page["/Resources"]["/XObject"].getObject()

    for obj in x_object:
        if x_object[obj]["/Subtype"] == "/Image":
            size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
            data = x_object[obj].getData()
            if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
                mode = "RGB"
            else:
                mode = "P"

            if x_object[obj]["/Filter"] == "/FlateDecode":
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif x_object[obj]["/Filter"] == "/DCTDecode":
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif x_object[obj]["/Filter"] == "/JPXDecode":
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

Solution 3

Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.

Solution 4

In Python with PyPDF2 for CCITTFaxDecode filter:

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
    return struct.pack(tiff_header_struct,
                       b'II',  # Byte order indication: Little indian
                       42,  # Version number (always 42)
                       8,  # Offset to first IFD
                       8,  # Number of tags in IFD
                       256, 4, 1, width,  # ImageWidth, LONG, 1, width
                       257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                       258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                       259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                       262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                       273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                       278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                       279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                       0  # last IFD
                       )

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
    page = cond_scan_reader.getPage(i)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            """
            The  CCITTFaxDecode filter decodes image data that has been encoded using
            either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
            designed to achieve efficient compression of monochrome (1 bit per pixel) image
            data at relatively low resolutions, and so is useful only for bitmap image data, not
            for color images, grayscale images, or general data.

            K < 0 --- Pure two-dimensional encoding (Group 4)
            K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
            K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
            """
            if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                if xObject[obj]['/DecodeParms']['/K'] == -1:
                    CCITT_group = 4
                else:
                    CCITT_group = 3
                width = xObject[obj]['/Width']
                height = xObject[obj]['/Height']
                data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                img_size = len(data)
                tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                img_name = obj[1:] + '.tiff'
                with open(img_name, 'wb') as img_file:
                    img_file.write(tiff_header + data)
                #
                # import io
                # from PIL import Image
                # im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

Solution 5

Libpoppler comes with a tool called "pdfimages" that does exactly this.

(On ubuntu systems it's in the poppler-utils package)

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

Windows binaries: http://blog.alivate.com.au/poppler-windows/

View more solutions

150,671

matt wilkie

hewer of maps, old time techie, newbie developer personal - www.maphew.com, @maphew in-between - yukongis.ca work - www.env.gov.yk.ca, @mhw-at-yg

Updated on February 15, 2022

Comments

matt wilkie about 2 years

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

I'm using python 2.7 but can use 3.x if required.
- nealmcb over 12 years
  
  Thanks. That "how images are stored in PDF" url didn't work, but this seems to: jpedal.org/PDFblog/2010/04/…
- matt wilkie over 8 years
  
  There is a JPedal java library which does this called PDF Clipped Image Extraction. The author, Mark Stephens, has a concise highlevel overview of how images are stored in PDF which may help someone building a python extractor.
- Gruber almost 3 years
  
  Link above from @nealmcb moved to blog.idrsolutions.com/2010/04/…
matt wilkie about 14 years

thanks Ned. It looks like the particular pdf's I need this for are not using jpeg in-situ, but I'll keep your sample around in case it matches up other things that turn up.
Filipe Correia almost 12 years

Image magick uses ghostscript to do this. You can check this post for the ghostscript command that image magick uses under the covers.
Raffi over 8 years

I have to say that sometimes the rendering is really bad. With poppler it works without any issue.
matt wilkie over 8 years

Initially excited by this, but it threw up NotImplementedError: unsupported filter /DCTDecode or ... /JPXDecode from xObject[obj].getData() in the first couple pdf's I tested. Details at gist.github.com/maphew/fe6ba4bf9ed2bc98ecf5
sylvain over 8 years

I have recently pushed the '/DCTDecode' modification to PyPDF2 library. You can use my repository: github.com/sylvainpelissier/PyPDF2 while it is integrated in the main branch.
matt wilkie over 8 years

Thanks for the update but sorry, still no go. Gist updated. I get ValueError: not enough image data for dctdecode embedded images and unsupported filter /JPXDecode on another pdf.
sylvain over 8 years

I updated it and it should work now. If you can we could add your pdf as an example in the PDF_Samples folder of PyPDF2.
matt wilkie over 8 years

making headway! The dctdecode pdf's are processed without error now (though sometimes output images are upside down). However the JPXDecode file now throws KeyError:/Filter instead. I updated the gist accordingly. The PDF files are just random ones from the 'net. The gist has source links.
user3599803 over 7 years

Can you please explain a few things in the code? For example, why would you search for "stream" first and then for startmark? you could just start searching the startmark as this is the start of JPG no? and what's the point of the startfix variable, you dont change it at all..
mlissner over 7 years

"It is simple..."
Volatil3 over 7 years

@sylvain NotImplementedError: unsupported filter /CCITTFaxDecode I installed from pip INSTALLED: 1.26.0 (latest)
sylvain over 7 years

@Volatil3 yes pip will install it from the official repository, my commit was not yet merged: github.com/mstamy2/PyPDF2/pull/237
crld over 7 years

This worked immediately for me, and it's extremely fast!! All my images came out inverted, but I was able to fix that with OpenCV. I've been using ImageMagick's convert using subprocess to call it but it is painfully slow. Thanks for sharing this solution
user1717828 almost 7 years

I would love if someone found a Python module that doesn't rely on pdfimages being installed on the subsystem.
matt wilkie almost 7 years

That looks interesting. Where did you find it? (And, formatting in your post is a bit messed up. Unbalanced quotes I think.)
Max A. H. Hartvigsen almost 7 years

nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html you can find the original post here...
vishvAs vAsuki over 6 years

A related question here..
Petri over 6 years

Finds the images for me, but they are cropped/sized wrong, all b&w and have horizontal lines :(
GrantD71 over 6 years

This code fails for me on '/ICCBased' '/FlateDecode' filtered images with img = Image.frombytes(mode, size, data) ValueError: not enough image data
Labo over 6 years

@GrantD71 I am not an expert, and never heard of ICCBased before. Plus your error is not reproducible if you don't provide the inputs.
matt wilkie over 6 years

Thanks Colton. Homebrew is MacOS only. It's good practice to note OS when instructions are platform specific.
user1847 over 6 years

@mattwilkie -- Thanks for the heads up. Will note this in my answer.
Alok Nayak about 6 years

it doesn't output images pagewise
Basj almost 6 years

I get a KeyError: '/ColorSpace', so I would replace your line with DeviceRGB by if '/ColorSpace' not in xObject[obj] or xObject[obj]['/ColorSpace'] == '/DeviceRGB':. Anyway, this didn't work for me at the end because the images were probably PNG (not sure).
Basj almost 6 years

This works great! (pip install pymudf needed first obviously)
Labo almost 6 years

@Basj my code is supposed to work with PNG too. What is the value of xObject[obj]['/Filter']?
Basj almost 6 years

It is /CCITTFaxDecode. Then this code works. Erratum: I now see my files are a lot of .tiff files but not PNG
Labo almost 6 years

Perfect! It seems I had the same problem as the version I use is updated: dropbox.com/s/0w4wlifdu82mmaa/PDF_extract_images.py?dl=0
Gerald almost 6 years

I adapted your code to work on both Python 2 and 3. I also implemented the /Indexed change from Ronan Paixão. I also changed the filter if/elif to be 'in' rather than equals. I had a PDF with the /Filter type ['/ASCII85Decode', '/FlateDecode']. I also changed the function to return image blobs rather than write to file. The updated code can be found here: gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a
Labo almost 6 years

@Gerald awesome, thanks! I'll look at the code and update my dropbox :)
VSZM over 5 years

*pip install pymupdf for the fellow googlers who are wondering why the above install fails
Damotorie over 5 years

Instead of pip install pymupdf trying pip install PyMuPDF more info
Darius Mandres over 5 years

@GrantD71I have the same error on '/FlateDecode'. I can't make sense of it. Did you ever end up figuring it out? I created a test .pdf with 2 images inside. One .png, one .jpg. The .jpg one extracts just fine but the .pdf one gives this error.
Darius Mandres over 5 years

@Petri Had the same issue. Just use img = Image.frombytes('RGB', size, data). It works for .png/.jpg/.tiff files so far for me. Although, you may run into some problems I haven't fully tested all use cases.
Dispenser about 5 years

As pointed out elsewhere your tiff_header_struct should read '<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * 8 + 'L'. Note in particular the 'L' at the end.
Evan Mata about 5 years

This package is quite helpful (and well documented) and deserves upvotes.
crash about 5 years

Hi, to solve the NotImplementedError: unsupported filter /CCITTFaxDecode problem the library must be manually installed from the master branch of the github page? Installing it with pip install PyPDF2 won't work?
sylvain about 5 years

Hi, it seems that the most maintained library nowadays is PyPDF4: github.com/claird/PyPDF4
Abhimanyu about 5 years

DCTDecode CCITTFaxDecode filters still not implemented.
hru_d about 5 years

I followed @vishvAsvAsuki link but this packages gives images with white border, so removed it following this stackoverflow question
Aakash Basu almost 5 years

Any help on this please: stackoverflow.com/questions/55899363/…
Sha Li almost 5 years

Hi there, minecart works perfectly but I got a small problem: sometimes the layout of the images is changed (horizontal -> vertical). Do you have any idea how I could avoid this? Thanks!
vault over 4 years

With this code I get RuntimeError: pixmap must be grayscale or rgb to write as png, can anyone help?
mxl over 4 years

Hello @Modem Rakesh goud, could you please provide the PDF file that triggered this error? Thank you!
Modem Rakesh goud over 4 years

Unfortunately, I can not share that pdf.
mxl over 4 years

Or would you eventually be in the possession of a program like Acrobat (not Reader, but the PRO version), or alternatively another PDF editing program which can extract a portion of the PDF and provide only that portion, or, just give me the traceback.print_exc() of the given error line, so that I can see what triggered it; or maybe opt for another of the solutions here on this site, as the one given here (to my understanding) is focused on providing a 1:1 lossless extraction of data from a PDF and may not be what you are looking for, thanks!
Oringa about 4 years

@vault This comment is outdated. You should change "if pix.n < 5" to "if pix.n - pix.alpha < 4" as the original condition does not correctly finds CMYK images.
matt wilkie about 4 years

This is useful information and it should be documented and shared, as you have just done. +1. However I suggest posting as your own new question and then self-answer because it doesn't address doing this in python, which is point of this Q. (Feel free to cross-link the posts as this is related.)
Marco about 4 years

Hi @mattwilkie, thanks for the advice, here is the question: stackoverflow.com/questions/60851124/…
matt about 4 years

This worked perfectly for the PDF I wanted to extract images from. (In case it helps anyone else, I saved his code as a .py file, then installed/used Python 2.7.18 to run it, passing the path to my PDF as the single command-line argument.)
Peter about 4 years

Best option IMO:After installing fitzon Win 10, I got the error: ModuleNotFoundError: No module named 'frontend', which was easily solved by installing pip install PyMuPDFas discussed here: stackoverflow.com/questions/56467667/…
Basj almost 4 years

After a few tests on many PDFs, neither @Sylvain's version, this version, nor Gerald's gist version works reliably, sadly. Still, big up for the effort!
Karol Zlot almost 4 years

Better version of this solution (with working /CCITTFaxDecode) can be found in PyPDF4 repository: github.com/claird/PyPDF4/blob/master/scripts/…
xax almost 4 years

This code worked for me, with almost no modifications. Thank you.
MJeremy over 3 years

hi, I still don't get how the getData() fixed?
havlock over 3 years

This snippet may fail to find what look like images but aren't. The package author has a helpful response to this at github.com/pymupdf/PyMuPDF/issues/469
Javi12 over 3 years

With minecart I get: pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode
Hobbes about 3 years

I tested this and it does exactly what I needed, thanks!. One point, filter = raw_image.stream_dict.Filter gives an error because filter is a function. When I change the name, I still get an error, NotImplementedError: don't know how to __str__ this object. I haven't been able to figure out what datatype .filter has.
andrewdotn about 3 years

Thanks for the comment. I’ve renamed filter to f to avoid the collision with Python’s built-in filter() function. raw_image.stream_dict.Filter is an instance of pikepdf.objects.Object for me; it seems to have a to_json() method you could try if str() isn’t doing what you want. But the PDF spec also indicates Filter may also be a list which might be part of what you’re seeing? That would be specific to the PDF you’re trying it on. You could try print(type(f)) and print(dir(f)) to see f’s type, attributes, and methods.
Matthias Fripp almost 3 years

This looks like it is now the easiest and most effective answer. I wish I'd seen it before I tried to implement this using PyPDF! One thing to mention: pikepdf crashed when I tried to export JBIG2 data, so then I installed jbig2dec (conda install jbig2dec) and it worked well. The code above saves image data directly if possible (DCTDecode > jpg, JPXDecode > jp2, CCITTFaxDecode > tif), and otherwise saves in a lossless PNG (JBIG2Decode, FlateDecode). I don't think you can do much better than that.
Matthias Fripp almost 3 years

This doesn't work with either PyPDF2 or PyPDF4. (1) It doesn't handle the /JBIG2Decode filter. You can partly fix that by saving the raw image data (data = xObject[obj]._data instead of data = xObject[obj].getData()). But jbig2 files are not widely supported, so this is not very useful. (2) The /CCITTFaxDecode filter also crashes in some cases (seems to happen because some PDFs store DecodeParams as an array of dictionaries but PyPDF expects a single dictionary). The PikePDF solution works much better.
Matthias Fripp almost 3 years

If you want a more "Pythonic" approach, you can also use the PikePDF solution in another answer. If you install jbig2dec (can be done with conda), that will also convert jbig2 images to png automatically.
user3072843 almost 3 years

This will convert the PDF into images, but it does not extract the images from the remaining text.
Steve Gon over 2 years

FYI this package is more than 5 years old and has not been updated since 2016.
Azhar Uddin Sheikh over 2 years

display is not defined
rmutalik over 2 years

@matt wilkie the problem is not with sylvain's answer. If you trace back the code, you will see that the author of PyPDF2 did not implement those 2 filters, as seen in this link on lines 348 and 353: github.com/mstamy2/PyPDF2/blob/master/PyPDF2/filters.py
Rufat over 2 years

For Windows, I compiled the jbig2dec file using Visual Studio and placed it in the Windows directory. The source code is here: jbig2dec.com. In the bat file: call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars32.bat" "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.30.30704\bin\Hostx86\‌x86\nmake.exe" msvc.mak
swestrup over 2 years

pdfimages often fails for images that are composed of layers, outputting individual layers rather than the image-as-viewed.
swestrup over 2 years

I tried this on a 56-page document full of images, and it only found ONE image on page 53. No idea what the issue is.
swestrup over 2 years

I get AttributeError: module 'pdfminer.pdfparser' has no attribute 'PDFDocument'
Shreyansh Dwivedi about 2 years

pyPDF2 library does not work any more in python version above 3.6, when it is used resultant out with dependency errors
sol about 2 years

i had to also pip install fitz
settwi about 2 years

maybe this is obvious, but you can also import sys and use sys.argv[1] instead of hard-coding a file name if you want to have a drag-and-drop script solution :)