Why are the images produced by pdfimages different when using the -all flag?

command-line pdf image-processing imagemagick

7,090

pdfimages -all returns the exact file that was stored in the pdf.

We can test this by doing a round-trip: starting with a jpg image, we add it to a pdf using LaTeX, extract it using pdfimages -all, and then compare it to the original. (The reason for using LaTeX will be explained later.)

I have the first jpg image as extracted from your link and I named it device.jpg. Let's put it in a PDF file using LaTeX:

$ cat img.tex 
\documentclass{article}
\usepackage{graphicx}
\begin{document}
\includegraphics[width=5in,keepaspectratio]{device}
\end{document}
$ pdflatex img
[...snip...]
Output written on img.pdf (1 page, 672455 bytes).
Transcript written on img.log.

Now, let's extract it using pdfimages -all and compare it with the original:

$ pdfimages -all img.pdf img-all
$ cmp device.jpg img-all-000.jpg 
$

The extracted jpg is byte-for-byte identical to the original.

Footnote: the reason for using LaTeX

The above test cannot be done using just any PDF creator. This is because not all PDF creators will put images into a PDF unmolested. For example, let's try ImageMagick's convert:

$ convert device.jpg device.pdf
$ pdfimages -all device.pdf device-all
$ cmp device.jpg device-all-000.jpg 
device.jpg device-all-000.jpg differ: byte 4, line 1

convert re-sampled the image to a smaller size before placing it in the pdf.

$ ls -1s device.jpg device-all-000.jpg 
528 device-all-000.jpg
656 device.jpg

Image accuracy was part of pdflatex's design goals. Other PDF creation software may, by default, "optimize" images before placing them in the PDF.

Update: ShreevatsaR points out that the img2pdf utility also provides a lossless method to convert images to PDF. Non-TeX users will also likely find it much simpler to use.

7,090

Orion751

Updated on September 18, 2022

Comments

Orion751 over 1 year

It's my understanding that pdfimages -all extracts images from PDFs in their native formats.

Therefore, I expected that the JPG (lossy) images extracted from that command would have the same pixel information as the .ppm and .pbm files produced without the -all option, as well as the PNG (lossless) files created when I right-click and save the image in Evince.

However, my use of the ImageMagick compare command tells me that there are differences in the images contained within the JPG files compared to the other options above. To reproduce, download the PDF in this link (https://fccid.io/document.php?id=2149405), use it as an argument for pdfimages and pdfimages -all and use the first .ppm file and the first .jpg file as arguments for compare. When I do this, it produces an image file containing red to indicate a difference in the images.

Is there something that I don't understand? Is pdfimages adding pixel information by default when it creates .ppm and .pbm files?
- John1024 almost 8 years
  
  Just how much difference is there between these images? Can you supply examples?
- Orion751 almost 8 years
  
  @John1024 I'm trying to get images displaying the problem, but the PNG's seem to be too large for Stack Overflow/Imgur.
- Orion751 almost 8 years
  
  @John1024 Would a link to a PDF source be of any help? To reproduce, download it, use it as an argument for pdfimages and pdfimages -all and use the first .ppm file and the first .jpg file as arguments for compare. When I do this, it produces an image file containing red to indicate a difference in the images - fccid.io/document.php?id=2149405
- John1024 almost 8 years
  
  I tried that and saw some red. I also tried using convert to convert a jpg to ppm (no pdf involvement) and then running compare on the two; I still got differences. It might be that there are some rounding-error issues with the conversion process that convert detects.
- Orion751 almost 8 years
  
  @John1024 Are you suggesting that these rounding-error issues may also apply to pdfimages -all? Prior to using a newer version of pdfimages for the -all flag, I was able to use convert on the example to convert it from .ppm to .png without getting red, but I also got red when I tried converting to .jpg with it.
- John1024 almost 8 years
  
  I was expecting that rounding errors occurred upon conversion to or from a lossy format (jpg). [As another possibility, which doesn't apply here, I would expect rounding errors to occur with conversion between true-color (24bit) and reduced-color (8bit).] So, I would expect pdfimages -all would give you the 'correct' image and conversion of a jpg (whether from a pdf or otherwise) to ppm would add (small) rounding errors. Having been frustrated with the imprecision of compare's documentation, I was hoping to do my own direct comparison but I haven't had enough free time yet.
- Orion751 almost 8 years
  
  @John1024 So you're saying that you think that pdfimages -all is producing the true image as it is stored on the .pdf while pdfimages (without -all) is altering the image to convert to .ppm, but am unsure?
- John1024 almost 8 years
  
  I just did a test (see answer) and pdfimages -all produces the exact byte-for-byte true image file. So, differences between the ppm and the jpg would appear to be errors produced when the conversion to ppm occurred.
ShreevatsaR over 4 years

To insert an image losslessly into a PDF, apart from LaTeX, one can also use img2pdf (GitLab, GitHub).
John1024 over 4 years

@ShreevatsaR Interesting. Thanks! Answer updated with a link to img2pdf.