How to know if a PDF contains only images or has been OCR scanned for searching?

47,558

Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.

I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html

Example usage of pdffonts:

C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique          Type 1C           yes yes no   13171  0
LGOKGM+Univers-Black                 Type 1C           yes yes no   13172  0
[....]

This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).

C:\downloads\> pdffonts examle1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique                 Type 1C           yes no  no   14    0
Arial                                TrueType          no  no  no   15    0

This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.

C:\downloads\> pdffonts examle2.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).

Example usage of pdftotext:

C:\downloads\> pdftotext ^
                   -layout ^
                   cisco-ip-phone-7911-guide6.1.pdf ^
                   cisco-ip-phone-7911-guide6.1.txt

This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...

Share:
47,558

Related videos on Youtube

Bratch
Author by

Bratch

��]ks�6����Xu��΄�e;v��١.��ʢ_In����@$$1�./v�����9IQ2i�iI�3|������Ã��^��l����p���7{<��⣀L0ƻ�������>9�^ ��Ug8蒆�j}<��Z�iO8l���ԧn�6w��j�G �X��w�j���5���_����2\9�-���5��j�������'|q�d�s>����CnrG��K���F̟�R>V3�&�q~�4�������5��Q���y�b!%�Cc������?�+]���GC{��1�27�%����﹫�Fv+xג��-T�3���pQD�}�qN�P.(c�d���3C�طb��u����hg��L VF���፽����1��k]�%����,���>^_�0�i��9l�>z����ڷ���M�82�UpK�l���E����ߜ��݃7G��a�{v�락휵��A1�@$��&u�k��y��4 ��� ��V ��MU+VZK%bc�0,>b�11��&H̆V'=ܚ�[�7�Oj87y$UZa�[�+ț��(���� ��A@���iA����:� o�X 2ڌ�K�/C)e�9;RJ��חT4�nW����Џ�0��W���{�j�^ }�����UF���O�4���J���E��Uz��{G�p�(kڿ�9qB�!�����<�i2�u5��C<�؜�Y�'zK�� |3�{��[�o�b������T?�'b�Zk7����F��ܡ������Z��Ź�A���!�����HH <���od> ��2�S�w=���Oo���f䀃�=Fg6z�� ���z�z�:��-�ƣѓ 19��A�_����嫵F���}g�sm[Oϸbn��+�:·�}��u��V:[4��2bV� y���f�ض�����v��%xC+�'.-�z�P�޽J�l�(�֯�K?�]���+#��Jܹ��Q���Ɖ������µ�8���Ǽ 6�ٗ'�ݬ�J� r�����'BxҲ�0�cM���|�S��L�*J�g-}����Nd1<�넄%8/3%�O*J"���?�Г�W�Cˤ0��@���"��G:;�b�v��K�b �D���ƫ���a�vҀ0�G �/�@���W��قxC (��x�����ة �CF��C͛��#��r��@��=۟R,9ܢ�����8pc{#��a!I,x��'�L����N�j7�*5|֕)�ܮ���#���lf�%}��)H���H�!�e������Z���sGX�/g�T�C&��*3K`��ϖ@zp ��N8+������ �e�~��$j�p/al��*�Ѐ� B����T!0� O��� �����*uC�������w~xh6�޶��p��j^k�u���齷G����������k ��-�Ӳ��<�ս�o5�x^͸��D��Om��WD���:tV҄2a,���M�� �o����s�{��&��m�,�X���s8 ~&#��$s\�/�ŗ�M�lAW�ZT��+�?P�󝜺b�qA"���0.:-}

Updated on July 09, 2022

Comments

  • Bratch
    Bratch almost 2 years

    I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

    I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

    I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

  • Bratch
    Bratch over 14 years
    Can you recommend one that you know works, or that I should try?
  • Bratch
    Bratch about 14 years
    Right, in my sample sets, the image based PDFs have a blank PDF Producer, but the ones that were OCR'd show, "Adobe Acrobat 8.16 Paper Capture Plug-in." But I found another one that has selectable text and the producer is, "Acrobat Distiller 5.0.5 (Windows)." And another with text, "createpdf.adobe.com v5.1." Others with text "Microsoft Office Word 2007" and "GPL Ghostscript 8.54." It seems like the producer is blank for image based PDFs but some other value for PDFs that contain text.
  • Dangling Piyush
    Dangling Piyush almost 10 years
    I tried your approach but for some scanned pdffile "pdffonts" command still returning Helvetica font? Can you explain or guide me how can I achieve this more accurately.Thanks
  • Kurt Pfeifle
    Kurt Pfeifle almost 10 years
    @DanglingPiyush: Without a sample of such a Scan-PDF file I'm not able to tell you were the Helvetica comes from. Can you provide a sample page that shows this behavior?
  • Dangling Piyush
    Dangling Piyush almost 10 years
    fileconvoy.com/… This is the link to sample pdf it is containing only scanned images but pdffonts shows Helvectica Font.Please have a look at it.
  • Dangling Piyush
    Dangling Piyush almost 10 years
    Thanks for your time.:)
  • Dangling Piyush
    Dangling Piyush almost 10 years
    :Have you looked at it?
  • Kurt Pfeifle
    Kurt Pfeifle almost 10 years
    @DanglingPiyush: This file contains a /Font object that is not really used anywhere in the file. (My theory about the cause of this is that the PDF creating software {the file's metadata calls it "Canon"} was set up to apply OCR, and this software uses Helvetica as its default OCR font, but in didn't identify any OCR-able text...)
  • Dangling Piyush
    Dangling Piyush almost 10 years
    Thanks alot for pointing this out,Can you guide me about How can I deal with such kind of files? And what tool you have used for extracting above mentioned information? so that in future it will be helpful for me.Thanks
  • Kurt Pfeifle
    Kurt Pfeifle almost 10 years
    @DanglingPiyush: I basically used two items: (1) A command: qpdf --qdf --object-streams=disable in order to de-compress (most) binary PDF objects and make the resulting file's PDF source code easily viewable/editable in a text editor. (2) The official PDF specification in order to understand the PDF source code.
  • Kurt Pfeifle
    Kurt Pfeifle almost 10 years
    @DanglingPiyush: You should check your scanner and its software if it provides a setting for you to disable automatic OCR of scanned pages.
  • Dangling Piyush
    Dangling Piyush almost 10 years
    Thank you so much for your guidance!!
  • debbybeginner
    debbybeginner about 4 years
    I know this is a very old post, but now have the same question. I was wondering if you can give some pointers on how to use the command tools at the link you gave? I'm afraid I don't normally use command tools (but am keen to learn) however I can't understand the doucmentation at the website. I think it assumes the user already know how to work with these tools. I work on a Mac and have used Terminal and som very basic shell comands... so any other pointers would be very helpful! thanks