Extract images from PDF without resampling, in python?
Solution 1
You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
Solution 2
In Python with PyPDF2 and Pillow libraries it is simple:
from PIL import Image
from PyPDF2 import PdfReader
def extract_image(pdf_file_path):
reader = PdfReader(pdf_file_path)
page = reader.pages[0]
x_object = page["/Resources"]["/XObject"].getObject()
for obj in x_object:
if x_object[obj]["/Subtype"] == "/Image":
size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
data = x_object[obj].getData()
if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
mode = "RGB"
else:
mode = "P"
if x_object[obj]["/Filter"] == "/FlateDecode":
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif x_object[obj]["/Filter"] == "/DCTDecode":
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif x_object[obj]["/Filter"] == "/JPXDecode":
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
Solution 3
Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.
Solution 4
In Python with PyPDF2 for CCITTFaxDecode filter:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()
Solution 5
Libpoppler comes with a tool called "pdfimages" that does exactly this.
(On ubuntu systems it's in the poppler-utils package)
http://poppler.freedesktop.org/
http://en.wikipedia.org/wiki/Pdfimages
Windows binaries: http://blog.alivate.com.au/poppler-windows/
Related videos on Youtube
matt wilkie
hewer of maps, old time techie, newbie developer personal - www.maphew.com, @maphew in-between - yukongis.ca work - www.env.gov.yk.ca, @mhw-at-yg
Updated on February 15, 2022Comments
-
matt wilkie about 2 years
How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.
I'm using python 2.7 but can use 3.x if required.
-
nealmcb over 12 yearsThanks. That "how images are stored in PDF" url didn't work, but this seems to: jpedal.org/PDFblog/2010/04/…
-
matt wilkie over 8 yearsThere is a JPedal java library which does this called PDF Clipped Image Extraction. The author, Mark Stephens, has a concise highlevel overview of how images are stored in PDF which may help someone building a python extractor.
-
Gruber almost 3 yearsLink above from @nealmcb moved to blog.idrsolutions.com/2010/04/…
-
-
matt wilkie about 14 yearsthanks Ned. It looks like the particular pdf's I need this for are not using jpeg in-situ, but I'll keep your sample around in case it matches up other things that turn up.
-
Filipe Correia almost 12 yearsImage magick uses ghostscript to do this. You can check this post for the ghostscript command that image magick uses under the covers.
-
Raffi over 8 yearsI have to say that sometimes the rendering is really bad. With poppler it works without any issue.
-
matt wilkie over 8 yearsInitially excited by this, but it threw up
NotImplementedError: unsupported filter /DCTDecode
or... /JPXDecode
fromxObject[obj].getData()
in the first couple pdf's I tested. Details at gist.github.com/maphew/fe6ba4bf9ed2bc98ecf5 -
sylvain over 8 yearsI have recently pushed the '/DCTDecode' modification to PyPDF2 library. You can use my repository: github.com/sylvainpelissier/PyPDF2 while it is integrated in the main branch.
-
matt wilkie over 8 yearsThanks for the update but sorry, still no go. Gist updated. I get
ValueError: not enough image data
for dctdecode embedded images andunsupported filter /JPXDecode
on another pdf. -
sylvain over 8 yearsI updated it and it should work now. If you can we could add your pdf as an example in the PDF_Samples folder of PyPDF2.
-
matt wilkie over 8 yearsmaking headway! The dctdecode pdf's are processed without error now (though sometimes output images are upside down). However the JPXDecode file now throws
KeyError:/Filter
instead. I updated the gist accordingly. The PDF files are just random ones from the 'net. The gist has source links. -
user3599803 over 7 yearsCan you please explain a few things in the code? For example, why would you search for "stream" first and then for
startmark
? you could just start searching thestartmark
as this is the start of JPG no? and what's the point of thestartfix
variable, you dont change it at all.. -
mlissner over 7 years"It is simple..."
-
Volatil3 over 7 years@sylvain
NotImplementedError: unsupported filter /CCITTFaxDecode
I installed frompip
INSTALLED: 1.26.0 (latest)
-
sylvain over 7 years@Volatil3 yes pip will install it from the official repository, my commit was not yet merged: github.com/mstamy2/PyPDF2/pull/237
-
crld over 7 yearsThis worked immediately for me, and it's extremely fast!! All my images came out inverted, but I was able to fix that with OpenCV. I've been using ImageMagick's
convert
usingsubprocess
to call it but it is painfully slow. Thanks for sharing this solution -
user1717828 almost 7 yearsI would love if someone found a Python module that doesn't rely on
pdfimages
being installed on the subsystem. -
matt wilkie almost 7 yearsThat looks interesting. Where did you find it? (And, formatting in your post is a bit messed up. Unbalanced quotes I think.)
-
Max A. H. Hartvigsen almost 7 yearsnedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html you can find the original post here...
-
vishvAs vAsuki over 6 yearsA related question here..
-
Petri over 6 yearsFinds the images for me, but they are cropped/sized wrong, all b&w and have horizontal lines :(
-
GrantD71 over 6 yearsThis code fails for me on '/ICCBased' '/FlateDecode' filtered images with
img = Image.frombytes(mode, size, data) ValueError: not enough image data
-
Labo over 6 years@GrantD71 I am not an expert, and never heard of ICCBased before. Plus your error is not reproducible if you don't provide the inputs.
-
matt wilkie over 6 yearsThanks Colton. Homebrew is MacOS only. It's good practice to note OS when instructions are platform specific.
-
user1847 over 6 years@mattwilkie -- Thanks for the heads up. Will note this in my answer.
-
Alok Nayak about 6 yearsit doesn't output images pagewise
-
Basj almost 6 yearsI get a
KeyError: '/ColorSpace'
, so I would replace your line with DeviceRGB byif '/ColorSpace' not in xObject[obj] or xObject[obj]['/ColorSpace'] == '/DeviceRGB':
. Anyway, this didn't work for me at the end because the images were probably PNG (not sure). -
Basj almost 6 yearsThis works great! (
pip install pymudf
needed first obviously) -
Labo almost 6 years@Basj my code is supposed to work with PNG too. What is the value of
xObject[obj]['/Filter']
? -
Basj almost 6 yearsIt is
/CCITTFaxDecode
. Then this code works. Erratum: I now see my files are a lot of .tiff files but not PNG -
Labo almost 6 yearsPerfect! It seems I had the same problem as the version I use is updated: dropbox.com/s/0w4wlifdu82mmaa/PDF_extract_images.py?dl=0
-
Gerald almost 6 yearsI adapted your code to work on both Python 2 and 3. I also implemented the /Indexed change from Ronan Paixão. I also changed the filter if/elif to be 'in' rather than equals. I had a PDF with the /Filter type ['/ASCII85Decode', '/FlateDecode']. I also changed the function to return image blobs rather than write to file. The updated code can be found here: gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a
-
Labo almost 6 years@Gerald awesome, thanks! I'll look at the code and update my dropbox :)
-
VSZM over 5 years*
pip install pymupdf
for the fellow googlers who are wondering why the above install fails -
Damotorie over 5 years
-
Darius Mandres over 5 years@GrantD71I have the same error on '/FlateDecode'. I can't make sense of it. Did you ever end up figuring it out? I created a test .pdf with 2 images inside. One .png, one .jpg. The .jpg one extracts just fine but the .pdf one gives this error.
-
Darius Mandres over 5 years@Petri Had the same issue. Just use
img = Image.frombytes('RGB', size, data)
. It works for .png/.jpg/.tiff files so far for me. Although, you may run into some problems I haven't fully tested all use cases. -
Dispenser about 5 yearsAs pointed out elsewhere your
tiff_header_struct
should read'<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * 8 + 'L'
. Note in particular the'L'
at the end. -
Evan Mata about 5 yearsThis package is quite helpful (and well documented) and deserves upvotes.
-
crash about 5 yearsHi, to solve the
NotImplementedError: unsupported filter /CCITTFaxDecode
problem the library must be manually installed from the master branch of the github page? Installing it withpip install PyPDF2
won't work? -
sylvain about 5 yearsHi, it seems that the most maintained library nowadays is PyPDF4: github.com/claird/PyPDF4
-
Abhimanyu about 5 yearsDCTDecode CCITTFaxDecode filters still not implemented.
-
hru_d about 5 yearsI followed @vishvAsvAsuki link but this packages gives images with white border, so removed it following this stackoverflow question
-
Aakash Basu almost 5 yearsAny help on this please: stackoverflow.com/questions/55899363/…
-
Sha Li almost 5 yearsHi there, minecart works perfectly but I got a small problem: sometimes the layout of the images is changed (horizontal -> vertical). Do you have any idea how I could avoid this? Thanks!
-
vault over 4 yearsWith this code I get
RuntimeError: pixmap must be grayscale or rgb to write as png
, can anyone help? -
mxl over 4 yearsHello @Modem Rakesh goud, could you please provide the PDF file that triggered this error? Thank you!
-
Modem Rakesh goud over 4 yearsUnfortunately, I can not share that pdf.
-
mxl over 4 yearsOr would you eventually be in the possession of a program like Acrobat (not Reader, but the PRO version), or alternatively another PDF editing program which can extract a portion of the PDF and provide only that portion, or, just give me the
traceback.print_exc()
of the given error line, so that I can see what triggered it; or maybe opt for another of the solutions here on this site, as the one given here (to my understanding) is focused on providing a 1:1 lossless extraction of data from a PDF and may not be what you are looking for, thanks! -
Oringa about 4 years@vault This comment is outdated. You should change "if pix.n < 5" to "if pix.n - pix.alpha < 4" as the original condition does not correctly finds CMYK images.
-
matt wilkie about 4 yearsThis is useful information and it should be documented and shared, as you have just done. +1. However I suggest posting as your own new question and then self-answer because it doesn't address doing this in python, which is point of this Q. (Feel free to cross-link the posts as this is related.)
-
Marco about 4 yearsHi @mattwilkie, thanks for the advice, here is the question: stackoverflow.com/questions/60851124/…
-
matt about 4 yearsThis worked perfectly for the PDF I wanted to extract images from. (In case it helps anyone else, I saved his code as a .py file, then installed/used Python 2.7.18 to run it, passing the path to my PDF as the single command-line argument.)
-
Peter about 4 yearsBest option IMO:After installing
fitz
on Win 10, I got the error: ModuleNotFoundError: No module named 'frontend', which was easily solved by installingpip install PyMuPDF
as discussed here: stackoverflow.com/questions/56467667/… -
Basj almost 4 yearsAfter a few tests on many PDFs, neither @Sylvain's version, this version, nor Gerald's gist version works reliably, sadly. Still, big up for the effort!
-
Karol Zlot almost 4 yearsBetter version of this solution (with working /CCITTFaxDecode) can be found in PyPDF4 repository: github.com/claird/PyPDF4/blob/master/scripts/…
-
xax almost 4 yearsThis code worked for me, with almost no modifications. Thank you.
-
MJeremy over 3 yearshi, I still don't get how the
getData()
fixed? -
havlock over 3 yearsThis snippet may fail to find what look like images but aren't. The package author has a helpful response to this at github.com/pymupdf/PyMuPDF/issues/469
-
Javi12 over 3 yearsWith minecart I get: pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode
-
Hobbes about 3 yearsI tested this and it does exactly what I needed, thanks!. One point,
filter = raw_image.stream_dict.Filter
gives an error becausefilter
is a function. When I change the name, I still get an error,NotImplementedError: don't know how to __str__ this object
. I haven't been able to figure out what datatype .filter has. -
andrewdotn about 3 yearsThanks for the comment. I’ve renamed
filter
tof
to avoid the collision with Python’s built-infilter()
function.raw_image.stream_dict.Filter
is an instance ofpikepdf.objects.Object
for me; it seems to have ato_json()
method you could try ifstr()
isn’t doing what you want. But the PDF spec also indicates Filter may also be a list which might be part of what you’re seeing? That would be specific to the PDF you’re trying it on. You could tryprint(type(f))
andprint(dir(f))
to seef
’s type, attributes, and methods. -
Matthias Fripp almost 3 yearsThis looks like it is now the easiest and most effective answer. I wish I'd seen it before I tried to implement this using PyPDF! One thing to mention: pikepdf crashed when I tried to export JBIG2 data, so then I installed
jbig2dec
(conda install jbig2dec
) and it worked well. The code above saves image data directly if possible (DCTDecode > jpg, JPXDecode > jp2, CCITTFaxDecode > tif), and otherwise saves in a lossless PNG (JBIG2Decode, FlateDecode). I don't think you can do much better than that. -
Matthias Fripp almost 3 yearsThis doesn't work with either PyPDF2 or PyPDF4. (1) It doesn't handle the
/JBIG2Decode
filter. You can partly fix that by saving the raw image data (data = xObject[obj]._data
instead ofdata = xObject[obj].getData()
). But jbig2 files are not widely supported, so this is not very useful. (2) The/CCITTFaxDecode
filter also crashes in some cases (seems to happen because some PDFs storeDecodeParams
as an array of dictionaries but PyPDF expects a single dictionary). The PikePDF solution works much better. -
Matthias Fripp almost 3 yearsIf you want a more "Pythonic" approach, you can also use the PikePDF solution in another answer. If you install
jbig2dec
(can be done withconda
), that will also convert jbig2 images to png automatically. -
user3072843 almost 3 yearsThis will convert the PDF into images, but it does not extract the images from the remaining text.
-
Steve Gon over 2 yearsFYI this package is more than 5 years old and has not been updated since 2016.
-
Azhar Uddin Sheikh over 2 yearsdisplay is not defined
-
rmutalik over 2 years@matt wilkie the problem is not with sylvain's answer. If you trace back the code, you will see that the author of PyPDF2 did not implement those 2 filters, as seen in this link on lines 348 and 353: github.com/mstamy2/PyPDF2/blob/master/PyPDF2/filters.py
-
Rufat over 2 yearsFor Windows, I compiled the jbig2dec file using Visual Studio and placed it in the Windows directory. The source code is here: jbig2dec.com. In the bat file:
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars32.bat"
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.30.30704\bin\Hostx86\x86\nmake.exe" msvc.mak
-
swestrup over 2 yearspdfimages often fails for images that are composed of layers, outputting individual layers rather than the image-as-viewed.
-
swestrup over 2 yearsI tried this on a 56-page document full of images, and it only found ONE image on page 53. No idea what the issue is.
-
swestrup over 2 yearsI get AttributeError: module 'pdfminer.pdfparser' has no attribute 'PDFDocument'
-
Shreyansh Dwivedi about 2 yearspyPDF2 library does not work any more in python version above 3.6, when it is used resultant out with dependency errors
-
sol about 2 yearsi had to also
pip install fitz
-
settwi about 2 yearsmaybe this is obvious, but you can also
import sys
and usesys.argv[1]
instead of hard-coding a file name if you want to have a drag-and-drop script solution :)