How can I convert a scanned PDF with OCRed text to one without OCRed text?

pdf evince ghostscript

5,026

Solution 1

Here is how I would remove the OCR-ed text should I have to...

First, you need to know, that OCR-ed text in a PDF is not a layer, but a special text rendering mode. The following screenshot from the official PDF specification lists all available text rendering modes:

For more background, please see these answers of mine on StackOverflow:

Now for the procedure I envisage:

0. Make a backup of your original PDF file

'nuff said...

1. Use `qpdf` to un-compress most of the PDF objects

qpdf is a beautiful command line tool to transform most PDFs into a form that makes it easier to manipulate through a text editor (or through sed):

qpdf                       \
  --qdf                    \
  --object-streams=disable \
    input.pdf              \
    editable.pdf

2. Search for spots where PDF code contains `3 Tr`

All spots in the editable.pdf where there is 'invisible' (a.k.a. neither filled nor stroked) text is marked by an initial definition of

3 Tr

Change these to now read

1 Tr

This should make the previously hidden text visible. Glyphs will appear in thick outlines, overlaying the original scanned page images.

It will look very ugly.

Save the edited PDF.

3. Change `Tj` and `TJ` text stroking operators to 'no-ops'

Whenever a text string is prepared for being rendered, the actual operator that is responsible for doing so is named Tj or TJ.

Look out for all of these. Replace them by tJ and tj. This will change them into 'no-ops': they have no meaning at all in the PDF source code; no PDF viewer or processor will "understand" them. (Be careful not to change the number of bytes when replacing stuff in PDF source code, because otherwise you may cause it to become "corrupted".)

Save the PDF file.

4. Check how the PDF file looks now

The PDF should now look "clean" again. The renamed text operators do not have any meaning any more for the PDF viewer, nor for any PDF interpreter.

5. Use Ghostscript to create the final PDF

This command should achieve what you want:

gs                        \
  -o final.pdf            \
  -sDEVICE=pdfwrite       \
  -dPDFSETTINGS=/prepress \
   editable.pdf

This final step uses editable.pdf as input. It outputs final.pdf. The output will have removed all traces of text. The input still had the text, albeit in an "unusable" form, because the operator renaming. Since Ghostscript does not "understand" the re-named operators, it will simply skip them by default.

Solution 2

There are multiple ways to get rid of the OCRed text in the file.

Export the scanned images from the PDF and recombine them. You can use pdfimages for the extraction (from the poppler-utils package) and convert (from imagemagick) to convert them back:
```
pdfimages toc.pdf toctmp
convert toctmp*.pbm newtoc.pdf
```
Print to PDF (with PDF support from cups-pdf)

PDF is a horrible format for scanned images, but quite often used because it can include multiple pages in one file. The storage format however often is the inappropriate (for scans) JPEG format. Recovering the original images (there is no such thing as the original scanned PDF file) from the PDF can probably not be done because making the PDF from the scanned images is most often the quality reducing step after scanning. You can try to get the images out of the PDF with pdfimage (or pdftoppm) but OCR software that works on images in PDF already knows how to get the best (only) quality images out of these PDFs, there is unlikely something you can do to improve that.

The problem probably lies with your scanning software, not with the OCR software. If you still have the original material, scan that one more to multipage TIFF (lzw compressed) that gives much better OCR than anything that got converted to PDF when that includes JPEG.

Solution 3

When I tried to access the link to your sample scanned file earlier, it didn't work for me. However, meanwhile I downloaded it, and had a closer look.

1. Using `pdfimages -list` to investigate the embedded images

If you run a recent (!) version of the Poppler variant of pdfimages, you'll have the -list parameter available. This parameter prints a useful list of images contained in your PDF file. The most recent versions also will tell you some additional info (like image resolution and compression ratio), which were not so easily available before.

Unfortunately, your PDF file contains some syntax errors, which give this garbled output:

kp@mbp:#175536> pdfimages -l 1 -list toc.pdf
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 Syntax Warning: Couldn't link the profiles
 Syntax Warning: Can't create transform
 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%

So let's redirect <stderr> output to /dev/null and try again:

kp@mbp:#175536> pdfimages -list toc.pdf 2>/dev/null
page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
   1   0 image  2000  2650  icc     1   1  jbig2  no       51 0   300   300 12.4K 1.9%
   2   1 image  2012  2659  icc     1   1  jbig2  no      616 0   300   301 16.1K 2.5%
   3   2 image  2014  2661  icc     1   1  jbig2  no      696 0   301   300 16.0K 2.4%
   4   3 image  2000  2650  icc     1   1  jbig2  no      778 0   300   300 16.2K 2.5%
   5   4 image  2000  2650  icc     1   1  jbig2  no      855 0   300   300 16.2K 2.5%
   6   5 image  2000  2650  icc     1   1  jbig2  no      938 0   300   300 15.7K 2.4%
   7   6 image  2000  2650  icc     1   1  jbig2  no     1026 0   300   300 15.5K 2.4%
   8   7 image  2022  2667  icc     1   1  jbig2  no     1103 0   300   300 15.7K 2.4%
   9   8 image  2000  2650  icc     1   1  jbig2  no     1190 0   300   300 15.5K 2.4%
  10   9 image  2011  2658  icc     1   1  jbig2  no     1271 0   300   301 15.7K 2.4%
  11  10 image  2000  2650  icc     1   1  jbig2  no     1347 0   300   300 15.7K 2.4%
  12  11 image  2010  2657  icc     1   1  jbig2  no     1429 0   300   300 15.5K 2.4%
  13  12 image  2000  2650  icc     1   1  jbig2  no     1504 0   300   300 16.8K 2.6%
  14  13 image  2000  2650  icc     1   1  jbig2  no     1589 0   300   300 15.4K 2.4%
  15  14 image  2000  2650  icc     1   1  jbig2  no     1666 0   300   300 17.6K 2.7%
  16  15 image  2010  2657  icc     1   1  jbig2  no     1740 0   300   300 18.7K 2.9%
  17  16 image  2006  2654  icc     1   1  jbig2  no     1823 0   300   301 17.7K 2.7%
  18  17 image  2007  2656  icc     1   1  jbig2  no     1905 0   300   300 16.9K 2.6%
  19  18 image  2000  2650  icc     1   1  jbig2  no     1983 0   300   300 16.7K 2.6%
  20  19 image  2000  2650  icc     1   1  jbig2  no     2065 0   300   300 17.4K 2.7%
  21  20 image  2000  2650  icc     1   1  jbig2  no     2148 0   300   300 17.4K 2.7%
  22  21 image  2011  2658  icc     1   1  jbig2  no     2229 0   300   301 17.2K 2.6%
  23  22 image  2006  2654  icc     1   1  jbig2  no     2305 0   300   301 17.5K 2.7%
  24  23 image  2000  2650  icc     1   1  jbig2  no     2377 0   300   300 14.5K 2.2%

This output means:

24 images (numbered 0--23) on 24 pages (each page 1 image).
All images have very similar dimensions (width/height) and a resolution of 300 PPI.
All images use the same compression method, JBIG2.

These results gives me confidence to suggest a different method to remove the OCR-ed text from your PDF:

Extract all images.
Create a new PDF from these images.

2. Extract all images from PDF

If you have one of the most recent Poppler versions of pdfimages, you are able to extract the images in the JBIG2 compression:

pdfimages -jbig2 toc.pdf toc--

The resulting image files will carry the file names toc---000.jb2e, toc---000.jb2e, ... (suffix .jb2e). Each of these files should have another one with it, named toc---000.jb2g, toc---000.jb2g, ... (suffix .jb2g).

If you do not get .jb2e images as a result, but .pbm instead, you'll have to use ImageMagick's convert to create JPEGs:

for i in toc--*.pbm; do
  convert $i ${i/.pbm/.jpg}
done

However, the JPEG images will be much bigger than the JBIG2 ones. (I tried it: JPEGs are in total 15 MByte, PBMs are in total 15 MBytes, JBIG2 are in total 436 kBytes for the 24 images!)

3. Create a new PDF from the extracted images

If you were unlucky and had to convert to JPEG, you can now convert these to a PDF:

convert toc--*.jpg -density out.pdf

Voila!, you now have a 15 MByte PDF file without the OCR-ed text, where you before had a 1.6 MByte PDF file with OCR-ed text! (But you'll not have lost much of the previous quality...)

_{Since my own pdfimages is compiled from sources, I from time to time suffer from a bug with it. Right now it does not correctly extract images as JBIG2 files. That's why I cannot create a PDF from them either. But this PDF's size would be similar to the original toc.pdf's size....}

5,026

Tim

Elitists are oppressive, anti-intellectual, ultra-conservative, and cancerous to the society, environment, and humanity. Please help make Stack Exchange a better place. Expose elite supremacy, elitist brutality, and moderation injustice to https://stackoverflow.com/contact (complicit community managers), in comments, to meta, outside Stack Exchange, and by legal actions. Push back and don't let them normalize their behaviors. Changes always happen from the bottom up. Thank you very much! Just a curious self learner. Almost always upvote replies. Thanks for enlightenment! Meanwhile, Corruption and abuses have been rampantly coming from elitists. Supportive comments have been removed and attacks are kept to control the direction of discourse. Outright vicious comments have been removed only to conceal atrocities. Systematic discrimination has been made into policies. Countless users have been harassed, persecuted, and suffocated. Q&A sites are for everyone to learn and grow, not for elitists to indulge abusive oppression, and cover up for each other. https://softwareengineering.stackexchange.com/posts/419086/revisions https://math.meta.stackexchange.com/q/32539/ (https://i.stack.imgur.com/4knYh.png) and https://math.meta.stackexchange.com/q/32548/ (https://i.stack.imgur.com/9gaZ2.png) https://meta.stackexchange.com/posts/353417/timeline (The moderators defended continuous harassment comments showing no reading and understanding of my post) https://cs.stackexchange.com/posts/125651/timeline (a PLT academic had trouble with the books I am reading and disparaged my self learning posts, and a moderator with long abusive history added more insults.) https://stackoverflow.com/posts/61679659/revisions (homework libels) Much more that have happened.

Updated on September 18, 2022

Comments

Tim almost 2 years
I have a scanned PDF file, with low-quality OCRed text.

I would like to have a PDF file without the OCRed text.

How can I convert a scanned PDF with OCRed text to without OCRed text?

I am thinking about what ways can recover the original scanned PDF file before OCR as much as possible, without changing the width and height of each page in pixels, and without changing the pixels per inch of each page?

Is some kind of rasterization again help? Will rasterization again loose the image quality?

Several attmepts:
1. I use the print to file in Evince, which I think uses cups-pdf, it doesn't remove OCRed text.
2. Following command using gs doesn't remove OCRed text either (I think I haven't found out how to use gs properly):
```
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
   -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf toc.pdf
```
- Nathaniel M. Beaver over 4 years
  
  Your example PDF file is a 404 error.
Tim over 9 years

Thanks. (1) I use the print to file in Evince, which I think uses cups-pdf, it doesn't remove OCRed text. (2) But can the ways of yours recover the original scanned pdf file before OCR as much as possible, without changing the width and height of each page in pixels, and without changing the PPI or DPI (pixels per inch) of each page?
Anthon over 9 years

@Tim Evince probably tries to be smart and include the ocr-ed text, did not know it could do so.
Tim over 9 years

I don't have scanners, and am only given the scanned pdf file with OCRed text. (see the link to the file in my first sentence, and my attempt at the end of my post)
Anthon over 9 years

@Tim assuming that you want to OCR the scans again can't you use the PDF as input and tell the software to redo the scan (and try harder?)
Tim over 9 years

(1) THe OCR software I have doesn't remove existing OCRed text, and re-OCR will result in both old and new OCRed text. (2) In Windows, Adobe Pro doesn't re-OCR a pdf file with OCRed text.
Anthon over 9 years

Use pdfimages and convert then, I updated the answer (point 1)
Tim over 9 years

Do the commands of pdfimages and convert not change the number of pixels along width and height, and not change the dpi?
Anthon over 9 years

The page size in points changes but not the number of pixels, the result doesn't influence display in e.g. evince. Why don't you just try, takes less then 10 seconds to do the conversion back and forth
Tim over 9 years

Thanks. Just verified that. What if a pdf page contains more than one images? Will pdfimages output each image as a file? How can we rasterize each pdf page to a single image, regardless of how many images in the pdf page?
Anthon over 9 years

@tim if this comes from a scan, or directly from LaTeX via DVI it will not have multiple images. AFAIK pdfiamges has no tricks to keep multiple images on a file together as names, if a file would have that.
Tim over 9 years

Yes, a pdf file with multiple images in one page doesn't come from scan, but it can come from LaTeX (using includegraphix multiple items to include several images will do the trick). For such a pdf file, is there a way to rasterize each page without losing quality as much as possible?
Anthon over 9 years

@tim have a look at div2bitmap it completely rasterizes the DVI file
Tim over 9 years

Thanks. But if a pdf file instead of a dvi file is given?
Anthon over 9 years

@tim Then you use pdftoppm --tiff --tiffcompression=lzw that doesn't extract the iamges, it renders the pages. You can set the DPI with -r.
Anthon over 9 years

@tim but OCR-ing that afterwards is not so smart as the text in the original PDF would be much easier extracted with pdftotext
Tim over 9 years

I run pdfimages ullman.pdf tmp/ullman to convert a pdf file with more than 1000 pages to pbm files. A problem is that pdfimages only spare 3 digits for page numbers when naming the image files, e.g. ullman-099.pbm ullman-1000.pbm ullman-1001.pbm ullman-1002.pbm ullman-1003.pbm ullman-1004.pbm ullman-1005.pbm ullman-1006.pbm ullman-1007.pbm ullman-1008.pbm ullman-1009.pbm ullman-100.pbm ullman-1010.pbm. When running convert ullman*.pbm new.pdf", the pages in new.pdf` are disordered. Is there any way to ask pdfimages to spare 4 digits for page numbers when naming the pbm image files?
Anthon over 9 years

@tim no, why don't you just do convert ullman-???.pbm ullman-????.pbm please start a new question if you have more question, this has nothing any more to do with the original problem.
Tim over 9 years

Thanks. I see. Is this problem related or not: I use pdfimages and convert on the pdf file with more than 1000 pages, and the size of the pdf file changes from 29MB to 373MB. I run pdftk ullman.pdf output new.pdf compress, and it doesn't reduce the file size. Is it possible to reduce the size in the middle of conversion, or is it better to do it after?
user755506 over 9 years

Anthon, how come you rate PDF as "a horrible format for scanned images"? Which one is better in your opinion? DjVu? If so, what about its practical value regarding its universal distribution of appropriate viewers?
user755506 over 9 years

@Anthon: Avoiding to answer my question?!? How come you can accumulate a reputation of 24k this way? I repeat: What, in your opinion, is the better format for scanned images than PDF? (I do not want to learn about biking to South-Africa, nor about proprietary viewers...)
Tim over 9 years

Thanks. (1) What does the final step actually do? Does it undo some previous step? (2) Can your way remove OCRed text from any pdf file? (3) can a pdf file processed by your way be OCRed again by Adobe Acrobat (which can't OCR a pdf file which has been OCRed)?
user755506 over 9 years

@Tim: (2) This method can be used to remove OCRed text from any PDF file. -- (3) Yes, the resulting PDF file can be OCRed again by Adobe Acrobat.
user755506 over 9 years

@Tim: GS requires no special option to remove the unusable text. Since it doesn't understand it, it simply skips these sections.
Tim over 9 years

(1) Does "The output will have removed all traces of text. The input still had the text, it was just unusable" mean the command by gs removes the unusable text? (3) Is it good to keep unusable text in a pdf file? If not, how can I remove the unusable text?
user755506 over 9 years

@Tim: Dear Tim, what do you think why I even wrote down step no. 5, if not for achieving exactly that: to remove the unusable text?!? sigh
Tim over 9 years

" Since it doesn't understand it, it simply skips these sections." makes me think gs doesn't remove unusable text
user755506 over 9 years

@Tim: Why don't you actually test what I outlined?! -- If Ghostscript skips these sections when trying to process them (while reading the input file), how do you think it can possibly again contain the unusable text in its output?!
porg almost 8 years

Can anyone provide a piped one liner command line combining all the steps described?
labreuer over 7 years

Just FYI, one could convert pbm files to png (or run a Poppler version of pdfimages with -png), then use agl/jbig2enc (generates jbig2 with globals), then use pdf.py (in that project) to create a pdf. I know this works if the pdf is made up exclusively of jbig2 images, one per page.
user755506 over 7 years

@labreuer: Just FYI, going the PNG route does not offer any advantages IMHO. If it does, please explain to me: which? Because PNG typically is larger than JPEG, so the disadvantages I clearly outlined (file size of new PDF sans OCR) would be even worse...
labreuer over 7 years

False in this case: when I switched your code to png, the resultant pdf I got was 1.97MB. You'd probably be right if we weren't dealing with bitonal images of text; png compresses those quite well. But it's also irrelevant, because I was only using png as an intermediary to jbig2. I knew I could do this, because your pdfimages -list results showed that all the images were jbig2.
user755506 over 7 years

@labreuer: Interesting. Thanks for checking this. I'll investigate this some more (and probably update my answer, giving you credit) once I find the time to do it.
labreuer over 7 years

Sure; it was quite easy, as I'm in the process of recompressing various PDFs into lossy jbig2 with globals using iTextSharp. When I applied that to the above pdf generated from pngs, I got a 264KB pdf. (For others who happen along, I may open source the resultant C# project at some point in the future. I have to decide how deeply to dive into understanding pdfs.)
vstepaniuk over 4 years

@KurtPfeifle so Step 2 is necessary only to see the results visually?
vstepaniuk over 4 years

@KurtPfeifle is gs -o output.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf the same?
user755506 about 4 years

@vstepaniuk: No. See here: stackoverflow.com/a/38010769/359307
vstepaniuk about 4 years

@KurtPfeifle Yes, I saw that - I am not as stupid as you think, Kurt Pfeifle... Now, do you think invisble OCR-ed text can be a vector or an image?
user755506 about 4 years

@vstepaniuk: Look, you asked. See, I answered. Where did I say I think you're stupid??
user755506 about 4 years

@vstepaniuk: If it is "text" in PDF, it uses a font (which may be composed of vectors or pixel image -- but in the end it is text). If it is "vector", it is no text. Your brain MAY interprete it as text once it is rendered by the viewer and seen by your eyes, but a PDF text extraction tool cannot do that, it can only handle fonts.
user755506 about 4 years

@vstepaniuk: OCR-ed text also uses fonts. If text is invisible, this invisibility is brought about by using a special PDF text rendering mode (3), see here: stackoverflow.com/a/11418938/359307
user755506 about 4 years

@vstepaniuk: To answer more exhaustively: Using Ghostscript with -dFILTERTEXT will remove all text with related fonts. Changing the PDF code first to 1 Tr does not remove the text, it only makes it invisible. Which in turn means that tools like pdftotext will still be able to extract it (likewise copy'n'paste from a PDF viewer). Transforming Tj and TJ operators to no-ops prevents the tools from extracting text, but does not prevent a human PDF hacker to restore the text by reversing the procedure of editing the PDF code.