How can I convert a scanned PDF with OCRed text to one without OCRed text?


Solution 1

Here is how I would remove the OCR-ed text should I have to...

First, you need to know, that OCR-ed text in a PDF is not a layer, but a special text rendering mode. The following screenshot from the official PDF specification lists all available text rendering modes:

For more background, please see these answers of mine on StackOverflow:

Now for the procedure I envisage:

0. Make a backup of your original PDF file

'nuff said...

1. Use qpdf to un-compress most of the PDF objects

qpdf is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed):

qpdf                       \
  --qdf                    \
  --object-streams=disable \
    input.pdf              \

2. Search for spots where PDF code contains 3 Tr

All spots in the editable.pdf where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of

3 Tr

Change these to now read

1 Tr

This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.

It will look very ugly.

Save the edited PDF.

3. Change Tj and TJ text stroking operators to 'no-ops'

Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj or TJ.

Look out for all of these. Replace them by tJ and tj. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)

Save the PDF file.

4. Check how the PDF file looks now

The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.

5. Use Ghostscript to create the final PDF

This command should achieve what you want:

gs                        \
  -o final.pdf            \
  -sDEVICE=pdfwrite       \
  -dPDFSETTINGS=/prepress \

This final step uses editable.pdf as input. It outputs final.pdf. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.

Solution 2

There are multiple ways to get rid of the OCRed text in the file.

  1. Export the scanned images from the PDF and recombine them. You can use pdfimages for the extraction (from the poppler-utils package) and convert (from imagemagick) to convert them back:

    pdfimages toc.pdf toctmp
    convert toctmp*.pbm newtoc.pdf
  2. Print to PDF (with PDF support from cups-pdf)

PDF is a horrible format for scanned images, but quite often used because it can include multiple pages in one file. The storage format however often is the inappropriate (for scans) JPEG format. Recovering the original images (there is no such thing as the original scanned PDF file) from the PDF can probably not be done because making the PDF from the scanned images is most often the quality reducing step after scanning. You can try to get the images out of the PDF with pdfimage (or pdftoppm) but OCR software that works on images in PDF already knows how to get the best (only) quality images out of these PDFs, there is unlikely something you can do to improve that.

The problem probably lies with your scanning software, not with the OCR software. If you still have the original material, scan that one more to multipage TIFF (lzw compressed) that gives much better OCR than anything that got converted to PDF when that includes JPEG.

Solution 3

When I tried to access the link to your sample scanned file earlier, it didn't work for me. However, meanwhile I downloaded it, and had a closer look.

1. Using pdfimages -list to investigate the embedded images

If you run a recent (!) version of the Poppler variant of pdfimages, you'll have the -list parameter available. This parameter prints a useful list of images contained in your PDF file. The most recent versions also will tell you some additional info (like image resolution and compression ratio), which were not so easily available before.

Unfortunately, your PDF file contains some syntax errors, which give this garbled output:

kp@mbp:#175536> pdfimages -l 1 -list toc.pdf
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
    1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%

So let's redirect <stderr> output to /dev/null and try again:

kp@mbp:#175536> pdfimages -list toc.pdf 2>/dev/null
page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
   1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%
   2   1 image  2012  2659  icc     1   1  jbig2  no      616 0   300   301 16.1K 2.5%
   3   2 image  2014  2661  icc     1   1  jbig2  no      696 0   301   300 16.0K 2.4%
   4   3 image  2000  2650  icc     1   1  jbig2  no      778 0   300   300 16.2K 2.5%
   5   4 image  2000  2650  icc     1   1  jbig2  no      855 0   300   300 16.2K 2.5%
   6   5 image  2000  2650  icc     1   1  jbig2  no      938 0   300   300 15.7K 2.4%
   7   6 image  2000  2650  icc     1   1  jbig2  no     1026 0   300   300 15.5K 2.4%
   8   7 image  2022  2667  icc     1   1  jbig2  no     1103 0   300   300 15.7K 2.4%
   9   8 image  2000  2650  icc     1   1  jbig2  no     1190 0   300   300 15.5K 2.4%
  10   9 image  2011  2658  icc     1   1  jbig2  no     1271 0   300   301 15.7K 2.4%
  11  10 image  2000  2650  icc     1   1  jbig2  no     1347 0   300   300 15.7K 2.4%
  12  11 image  2010  2657  icc     1   1  jbig2  no     1429 0   300   300 15.5K 2.4%
  13  12 image  2000  2650  icc     1   1  jbig2  no     1504 0   300   300 16.8K 2.6%
  14  13 image  2000  2650  icc     1   1  jbig2  no     1589 0   300   300 15.4K 2.4%
  15  14 image  2000  2650  icc     1   1  jbig2  no     1666 0   300   300 17.6K 2.7%
  16  15 image  2010  2657  icc     1   1  jbig2  no     1740 0   300   300 18.7K 2.9%
  17  16 image  2006  2654  icc     1   1  jbig2  no     1823 0   300   301 17.7K 2.7%
  18  17 image  2007  2656  icc     1   1  jbig2  no     1905 0   300   300 16.9K 2.6%
  19  18 image  2000  2650  icc     1   1  jbig2  no     1983 0   300   300 16.7K 2.6%
  20  19 image  2000  2650  icc     1   1  jbig2  no     2065 0   300   300 17.4K 2.7%
  21  20 image  2000  2650  icc     1   1  jbig2  no     2148 0   300   300 17.4K 2.7%
  22  21 image  2011  2658  icc     1   1  jbig2  no     2229 0   300   301 17.2K 2.6%
  23  22 image  2006  2654  icc     1   1  jbig2  no     2305 0   300   301 17.5K 2.7%
  24  23 image  2000  2650  icc     1   1  jbig2  no     2377 0   300   300 14.5K 2.2%

This output means:

  • 24 images (numbered 0--23) on 24 pages (each page 1 image).
  • All images have very similar dimensions (width/height) and a resolution of 300 PPI.
  • All images use the same compression method, JBIG2.

These results gives me confidence to suggest a different method to remove the OCR-ed text from your PDF:

  1. Extract all images.
  2. Create a new PDF from these images.

2. Extract all images from PDF

If you have one of the most recent Poppler versions of pdfimages, you are able to extract the images in the JBIG2 compression:

pdfimages -jbig2 toc.pdf toc--

The resulting image files will carry the file names toc---000.jb2e, toc---000.jb2e, ... (suffix .jb2e). Each of these files should have another one with it, named toc---000.jb2g, toc---000.jb2g, ... (suffix .jb2g).

If you do not get .jb2e images as a result, but .pbm instead, you'll have to use ImageMagick's convert to create JPEGs:

for i in toc--*.pbm; do
  convert $i ${i/.pbm/.jpg}

However, the JPEG images will be much bigger than the JBIG2 ones. (I tried it: JPEGs are in total 15 MByte, PBMs are in total 15 MBytes, JBIG2 are in total 436 kBytes for the 24 images!)

3. Create a new PDF from the extracted images

If you were unlucky and had to convert to JPEG, you can now convert these to a PDF:

convert toc--*.jpg -density out.pdf

Voila!, you now have a 15 MByte PDF file without the OCR-ed text, where you before had a 1.6 MByte PDF file with OCR-ed text! (But you'll not have lost much of the previous quality...)

Since my own pdfimages is compiled from sources, I from time to time suffer from a bug with it. Right now it does not correctly extract images as JBIG2 files. That's why I cannot create a PDF from them either. But this PDF's size would be similar to the original toc.pdf's size....


