Use Ghostscript, but tell it to not reprocess images?

17,016

If you just want to concatenate two PDF files without any reprocessing of its content, pdftk is for you. (On Mac OS X this should be available via MacPorts or Fink, for Linux, there are native packages for all major distributions; for Windows, look here.) Try this:

 pdftk title.pdf content.pdf cat output book.pdf

This will prepend the title.pdf to the content.pdf and write the result into book.pdf.

pdftk is a "dumb", but very fast way to concatenate two (or more) PDF files. "Dumb" in so far, as pdftk does not in any way interpret the PDF data stream, it just makes sure that the internal object numbers are re-reshuffled as needed and appear in the PDF xref structure (which basically is a sort of PDF ToC for objects).

Ghostscript:

If you want to use Ghostscript, the basic command to concatenate the same two files would be:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
   title.pdf \
   content.pdf

However, as you experienced, this simple command line may mess up your image quality. The reason is that Ghostscript is not 'dumb' when it processes PDFs: it completely interpretes them when reading in, and creates a completely new file when writing out the result. For creating the result, it will automatically be using default settings for a lot of details in the overall processing. These defaults will apply for all cases where its invocations had not instructed Ghostscript otherwise.

So Ghostscript's method to create the new book.pdf is much more "intelligent" (but also much slower) than pdftk's method. (This is also the reason why Ghostscript in many cases is able to --within limits-- "repair" b0rken PDF files, or to embed fonts into the output PDFs which are not embedded in input PDFs, or to remove duplicate images, replacing them by mere references, etc. -- and overall created smaller, better optimized files from bloated input PDFs...)

The solution is to not let Ghostscript use its defaults: by adding more custom parameters to the command line.

What does it mean "Ghostscript 'interprets' its PDF input"?

All of the file and its contents (objects, streams, fonts, images,...) are read in, checked and held in its own internal representation, before spitting out the resulting PDF with its PDF objects again. However, when 'spitting out', Ghostscript will apply all of its internal default settings for the hundreds of parameters [*] which there are available.

Unfortunately, this causes your "reprocessing" of images according to these default settings -- which can only be avoided or overridden by adding your own (desired) commandline parameters.

Your image problems could be caused by Ghostscript's need (due to licensing issues) to re-encode JPEG2000 images to JPEG encoding. If you want to avoid this, add the following to your commandline:

-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
-dColorImageFilter=/FlateEncode \
-dGrayImageFilter=/FlateEncode \

Pay attention that the above /FlateEncode means that any JPEG stream contained in your input PDF file will be converted to raw (PPM) stream. This will increase massively your generated PDF file.

Other image-related commandline options to consider for including are:

-dColorConversionStrategy=/LeaveColorUnchanged \
-dDownsampleMonoImages=false \
-dDownsampleGrayImages=false \
-dDownsampleColorImages=false \

So the complete Ghostscript commandline that could make you happy should read:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dDownsampleMonoImages=false \
  -dDownsampleGrayImages=false \
  -dDownsampleColorImages=false \
  -dAutoFilterColorImages=false \
  -dAutoFilterGrayImages=false \
  -dColorImageFilter=/FlateEncode \
  -dGrayImageFilter=/FlateEncode \
   title.pdf \
   content.pdf

You could also tell Ghostscript NOT to compress images at all in the output PDF, by using this commandline:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dEncodeColorImages=false \
  -dEncodeGrayImages=false \
  -dEncodeMonoImages=false \
   title.pdf \
   content.pdf

.


[*]:
If you are interested to learn about a complete list of default settings which Ghostscript's pdfwrite device is using, run the following command. It returns you the full list:

 gs \
   -sDEVICE=pdfwrite \
   -o /dev/null \
   -c "currentpagedevice { exch ==only ( ) print == } forall"

For explanations about what exactly all these parameters do mean, you'll have to read up in the Adobe documentation about "Distiller Parameters". Ghostscript tries very hard to mimic all these...

Share:
17,016

Related videos on Youtube

Mahmoud Al-Qudsi
Author by

Mahmoud Al-Qudsi

Inventor, founder, hardware engineer, software developer, blogger, systems architect, and tech enthusiast. Currently Director of NeoSmart Technologies: http://neosmart.net/ Disclaimer for any posts on the topic: I'm the author of EasyBCD (it's free). @mqudsi on twitter [email protected]

Updated on September 18, 2022

Comments

  • Mahmoud Al-Qudsi
    Mahmoud Al-Qudsi over 1 year

    I have a PDF that has already compressed and somewhat artifact-y images, and I'm using Ghostscript to prepend a title page to that PDF.

    However, I cannot find any way to tell GS to just use the existing images as-is without reprocessing them, and now I'm feeling as if it's something to do with how GS works, i.e. you can't recompile/link a PDF without reprocessing its images.. Is that true?

    I can raise the DPI setting in GS, but it'll go from 5MB to 60MB while still looking worse.

    Is there any better alternative to GS that'll do what I need (preferably that will compile on OS X)?

    • Kurt Pfeifle
      Kurt Pfeifle over 12 years
      Can you edit your question and quote the exact commandline you are using to prepend your title page to the original PDF? Then I could tell you what exactly to change or add to the commandline in order to get a better output for images...
    • Mahmoud Al-Qudsi
      Mahmoud Al-Qudsi over 12 years
      I don't want to just have it look better, I want to merge without reprocessing. This will a) result in better quality (lossless transforms), and b) not waste hours of CPU time processing my 1000+ page document.
    • Kurt Pfeifle
      Kurt Pfeifle over 12 years
      Hey, you didn't answer my question and you didn't quote the exact GS commandline you are using. Which means: you'll not be getting the help regarding GS you're looking for...
  • Dor
    Dor about 8 years
    (FYI) In my case, the flags dEncodeColorImages, dEncodeGrayImages, dEncodeMonoImages cause the output file to become a lot more massive. By removing them, the file size changed from 22MB to 3.1MB and the image quality seems exactly as with using these flags. All the unique flags with I use are: dColorConversionStrategy=/LeaveColorUnchanged, dDownsampleMonoImages=false, dDownsampleGrayImages=false, dDownsampleColorImages=false, dAutoFilterColorImages=false, dAutoFilterGrayImages=false, dColorImageFilter=/FlateEncode, dGrayImageFilter=/FlateEncode
  • Louis Somers
    Louis Somers over 4 years
    @Kurt Pfeifle What options are allowed for -dColorImageFilter? I can only find FlateEncode and DCTEncode. DCT seems to do JPEG (why did they encrypt that?). I think FLATE is an outdated option for images by now since Bell Labs patent on LZW is no longer an issue? However after spending quite some time searching I cannot find how to use PNG (or anything else)... My original images are PNG and I want them to stay unchanged. I tried the -c option, but it gives me -c can only be used in a built with POSTSCRIPT included....
  • Mahmoud Al-Qudsi
    Mahmoud Al-Qudsi over 2 years
    I continue to run into problems using ghostscript to do this. Concatenating two PDFs with the gs approach outlined above gives me a 6.35MiB file, while using pdftk gives me the correct/expected 1.48MiB result.