How to convert a color pdf to black-white?

46,526

Solution 1

The gs example

The gs command you're running above has a trailing $1 which is typically meant for passing command line arguments into a script. So I'm not sure what you actually tried but I'm guessing that you tried to put that command into a script, script.sh:

#!/bin/bash

gs      -sOutputFile=output.pdf \
        -q -dNOPAUSE -dBATCH -dSAFER \
        -sDEVICE=pdfwrite \
        -dCompatibilityLevel=1.3 \
        -dPDFSETTINGS=/screen \
        -dEmbedAllFonts=true \
        -dSubsetFonts=true \
        -sColorConversionStrategy=/Mono \
        -sColorConversionStrategyForImages=/Mono \
        -sProcessColorModel=/DeviceGray \
        $1

And run it like this:

$ ./script.sh: 19: ./script.sh: output.pdf: not found

Not sure how you setup this script but it needs to executable.

$ chmod +x script.sh

Something definitely doesn't seem right with that script though. When I tried it I got this error instead:

Unrecoverable error: rangecheck in .putdeviceprops

An alternative

Instead of that script I'd use this one from the SU question instead.

#!/bin/bash

gs \
 -sOutputFile=output.pdf \
 -sDEVICE=pdfwrite \
 -sColorConversionStrategy=Gray \
 -dProcessColorModel=/DeviceGray \
 -dCompatibilityLevel=1.4 \
 -dNOPAUSE \
 -dBATCH \
 $1

Then run it like this:

$ ./script.bash LeaseContract.pdf 
GPL Ghostscript 8.71 (2010-02-10)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2

Solution 2

I found a script here that can do this. It requires gs which you seem to have but also pdftk. You have not mentioned your distribution but on Debian-based systems, you should be able to install it with

sudo apt-get install pdftk

You can find RPMs for it here.

Once you have installed pdftk, save the script as graypdf.sh and run like so:

./greypdf.sh input.pdf

It will create a file called input-gray.pdf. I am including the whole script here to avoid link rot:

# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however). 

Solution 3

I also had some scanned color pdfs and grayscale pdfs that I wanted to convert to bw. I tried using gs with the code listed here, and image quality is good with pdf text still there. However, that gs code only converts to grayscale (as asked in the question) and still has large file size. convert yields very poor results when used directly.

I wanted bw pdfs with good image quality and small file size. I would have tried terdon's solution, but I could not get pdftk on centOS 7 using yum (at time of writing).

My solution uses gs to extract grayscale bmp files from the pdf, convert to threshold those bmps to bw and save them as tiff files, and then img2pdf to compress the tiff images and merge them all into one pdf.

I tried going directly to tiff from the pdf but the quality is not the same so I save each page to bmp. For a one page pdf file, convert does a great job from bmp to pdf. Example:

gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -r300x300 \
   -sOutputFile=./pdf_image.bmp ./input.pdf

convert ./pdf_image.bmp -threshold 40% -compress zip ./bw_out.pdf

For multiple pages, gs can merge multiple pdf files into one, but img2pdf yields smaller file size than gs. The tiff files must be uncompressed as input to img2pdf. Keep in mind for large numbers of pages, the intermediate bmp and tiff files tend to be large in size. pdftk or joinpdf would be better if they can merge compressed pdf files from convert.

I imagine there is a more elegant solution. However, my method produces results with very good image quality and much smaller file size. To get text back in the bw pdf, run OCR again.

My shell script uses gs, convert, and img2pdf. Change the parameters (# of pages, scan dpi, threshold %, etc) listed in the beginning as needed, and run chmod +x ./pdf2bw.sh . Here is the full script (pdf2bw.sh):

#!/bin/bash

num_pages=12
dpi_res=300
input_pdf_name=color_or_grayscale.pdf
bw_threshold=40%
output_pdf_name=out_bw.pdf
#-------------------------------------------------------------------------
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -q -r$dpi_res \
   -sOutputFile=./%d.bmp ./$input_pdf_name
#-------------------------------------------------------------------------
for file_num in `seq 1 $num_pages`
do
  convert ./$file_num.bmp -threshold $bw_threshold \
          ./$file_num.tif
done
#-------------------------------------------------------------------------
input_files=""

for file_num in `seq 1 $num_pages`
do
  input_files+="./$file_num.tif "
done

img2pdf -o ./$output_pdf_name --dpi $dpi_res $input_files
#-------------------------------------------------------------------------
# clean up bmp and tif files used in conversion

for file_num in `seq 1 $num_pages`
do
  rm ./$file_num.bmp
  rm ./$file_num.tif
done

Solution 4

I get reliable results cleaning up scanned pdf's to good contrast with this script;

#!/bin/bash
# 
# $ sudo apt install poppler-utils img2pdf pdftk imagemagick
#
# Output is still greyscale, but lots of scanner light tone fuzz removed.
#

pdfimages $1 pages

ls ./pages*.ppm | xargs -L1 -I {} convert {}  -quality 100 -density 400 \
  -fill white -fuzz 80% -auto-level -depth 4 +opaque "#000000" {}.jpg

ls -1 ./pages*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf

pdftk pages*.pdf cat output ${1/.pdf/}_bw.pdf

rm pages*

Solution 5

RHEL6 and RHEL5, which both baseline Ghostscript on 8.70, couldn't use the forms of the command given above. Assuming a script or a function expecting the PDF file as the first argument "$1", the following should be more portable:

gs \
    -sOutputFile="grey_$1" \
    -sDEVICE=pdfwrite \
    -sColorConversionStrategy=Mono \
    -sColorConversionStrategyForImages=/Mono \
    -dProcessColorModel=/DeviceGray \
    -dCompatibilityLevel=1.3 \
    -dNOPAUSE -dBATCH \
    "$1"

Where the output file will be prefixed with "grey_".

RHEL6 and 5 can use CompatibilityLevel=1.4 which is much quicker, but I was aiming for portability.

Share:
46,526

Related videos on Youtube

BowPark
Author by

BowPark

Updated on September 18, 2022

Comments

  • BowPark
    BowPark over 1 year

    I'd like to transform a pdf with some coloured text and images in another pdf with only black&white, in order to reduce its dimensions. Moreover, I would like to keep the text as text, without transforming the pages elements in pictures. I tried the following command:

    convert -density 150 -threshold 50% input.pdf output.pdf
    

    found in another question, a link, but it does what I don't want: the text in the output is transformed in a poor image and is no longer selectable. I tried with Ghostscript:

    gs      -sOutputFile=output.pdf \
            -q -dNOPAUSE -dBATCH -dSAFER \
            -sDEVICE=pdfwrite \
            -dCompatibilityLevel=1.3 \
            -dPDFSETTINGS=/screen \
            -dEmbedAllFonts=true \
            -dSubsetFonts=true \
            -sColorConversionStrategy=/Mono \
            -sColorConversionStrategyForImages=/Mono \
            -sProcessColorModel=/DeviceGray \
            $1
    

    but it gives me the following error message:

    ./script.sh: 19: ./script.sh: output.pdf: not found
    

    Is there any other way to create the file?

  • Sora.
    Sora. over 6 years
    You're right, there's something wrong with the script: "something" in this case would be sProcessColorModel which should be dProcessColorModel instead.
  • Igor
    Igor about 5 years
    Devs say (1,2,3,4) that there's no sColorConversionStrategyForImages switch.
  • Rich
    Rich about 5 years
    Thanks, @Igor -- I have no idea where I got that snippet from! I know for a fact that I tested it and it worked at the time. (And that, folks, is why you should always provide references for your code.)
  • Igor
    Igor about 5 years
    That "fake parameter" seem to be an incredibly popular among the web. GS ignores unknown switches (which is sad), so it works anyway.
  • Admin
    Admin almost 2 years
    Automatic replacement suggestion for output file name : input.pdf => input_bw.pdf -sOutputFile=${1%%.*}_bw.pdf