Why 'cat' can't read content of pdf files?

6,893

Solution 1

If you call cat on a file containing a text in Chinese¹, it won't print out an English translation. With computer formats, it's the same thing: if you call cat on a file containing data in a certain format, it won't translate it to another format such as plain text. That's not its job: its job is to copy its input to its output without modifying it.

A PDF file isn't a text file. A PDF file can contain text, along with formatting instructions, images, hyperlinks, etc. If you want to read the text in a PDF file, you need to use a tool that understands the PDF file format.

There are a few some recognizable bits in the PDF file: NimbusRomNo9L suggests that the text is written in a Nimbus Roman font. This isn't one of the few fonts that all PDF viewers and printers must have, so it had to be embedded in the PDF file. The text itself (abc) isn't buried in the output because it's compressed.

A common tool to view files regardless of what format they're in is xdg-open. On Debian and derivatives, see is an alternative. Both work by guessing the file format from the extension of the file name and calling an appropriate application. If you want to explicitly extract the text parts (and forget about other information such as images, fonts, the location of the text on the page, etc.), you can call a program to convert the PDF file into text, such as pdftotext.

¹ If you understand Chinese, substitue Georgian, or Kanada, or Cree, or whatever language you don't speak.

Solution 2

It is because pdf is not plain text. cat can only print the file as-is. To see the contents of a pdf file using the command line, you can use pdftotext.

pdftotext pdffile -

Solution 3

cat(1) doesn't "print" files, it is a tool to take a sequence of input files and concatenate them into an output.

What cat file by itself does is to take the contents of file and output it to the terminal. If the file contains text, the terminal will show it as such. PDF files aren't text, and (like any non-text) show as gibberish. Each file type needs some specific program to render it intelligibly.

Today printers often take PDF input to render on the page, or the tools used to print translate into whatever the printer takes automatically.

Solution 4

You can use pdfgrep.

pdfgrep . file.pdf
Share:
6,893

Related videos on Youtube

Mr. Orange
Author by

Mr. Orange

Updated on September 18, 2022

Comments

  • Mr. Orange
    Mr. Orange over 1 year

    When tried the command cat < 1.pdf it printed a very large output, which was totally incomprehensible to me. The content of 1.pdf was abc.

    The output was like this:

    ÀýÓëöûcÎ=ÉÐÎTaüÍ8]ö¹mg:=Rú*@H1S¢▒ùá½~Ì8u_4,¬7ïy­t#¯ÚZ|åôÛ~«Æ    fM²JKÁNÿ6 ì©ìÞ¾▒bT
    ¦åÊmBíöÖ¡÷ÄïÝM{Í1¹@;ÄqÄú             t]È7DJ   Êûc0£jÜÖã­\0O8À±(2)èJR'Ø÷=~ÝƵ¡´ oÇKÈ]¹ÞÜY)ÚwÒ?[4ò©Ió¦>G)î¾J&d}ýíÜÅÓò~Ø0 $´Në¿´Èc®pVqí+ëCppG¾ùóßeõõ6GÌ,öfú8Ô7»S[¢S50c­q/_9¹jó¿·Ü%×­tQSßî▒LðbkÂÒxâ£Ö▒üVAûÇamÏ·Â׫H´+ÆWíç´upèó`I]± ÎëÚwiòtçúwAhO¼²´'Æ©ëÀ0lô?¿ÌIò▒ìXË<»ÅUepçæå¥
    SïÒFҽϷº®Ën.Z×´\£ÁEH@®2ÊçC¢n½¡hÑâ>º´¢YÚXEfg    sôë¥*|zº7>ù!I©Åÿ«;        ;&==
    )dS/),÷È´:ÞõH:CÉÑÀiTÌw!u@Âp2÷AÒfµòÜtFIZ^iÿà£ùÖ5ÐsDiërÿ$0b6Ëü~xÏ·._ÏÒõÜr²`wYù;¤²å»äE3óù²ëvÇ»Ó'ãµ~?ÿîMZÍPkh{aÙ1y&tüÙòÕMoó¬²<ñ/ÇÖa?üʯuÝÓjû,¨Üå@/GMa-èGkD}¤ð©fZbYÑlt/      ±Øj¦èRhCå1âÆñ±S@ÖòÁ~e}
    >NÀ^²Jà-Û[Mø¡FËB7ÉVy0|ôÉÏjx[ÙÁnneê)wã+ök'R6"dÞqît¿ý,ߢ]MöV>»Ñ@ÞwM0®èçã^F`çFÕ²æL((¬±S¢ÅïÂy§púÓ­Ë5y1pÆ{uxëÈOþ'¾7+Öº!í
    uV-R²f*`æ\ías\Øl^÷ ÿ`r1|yÅ-Y­Ø,º·¢▒ÀPæá¸EW0d¤q]&ÿdV6ß.cùÂ~´óðCß▒(¨îMëb#òEnÑ»PÅV½!ÀÈѵ                              c´è
    jFÇé¨J$ǵÀcu?4·[ö&å:1&OÓö(øyKxòëÑq¸çÎÇÈI#5¨çû,'µÐûfG¸Í§³UÚëÎCDøõe²Ñú$Á½é½Ocø»Éßs! ÀõE²©)8½îv¿<Üî|趻B▒ÿYw¹·ÌÞƶâôIÇ.>¾H¡n¬Éüׯ*m«¶£L£#7È?¾sÊNoXµ·àMÚ
    ?ó´ZìâþÌçùä½ÿ$qÀÊcOºùdewænår▒ÖB½dfÕ;­t4Êe3#ÄúÀ£çP=¨QÌ▒ÕþºÑ\U¼Fµ»â¯/!NZ=>½éú©,EÉ|ªQafu,5Ý%Xw%seàØÇÇTª    BZëCaßî;zÃ"Bma¤ y=ÞwÁű~ÿõåEyV/Ò%q¥Ì^Ç  2U¸âQ³1y(¾&¨òYùÆ«}üx#Á®úÅÿÆðö.i8
                  ïþ¨è|Âý6\ U+ᬮ[®eVéüvíÜ{ÈL+]¬)ùxþecä溰ÿoö?,Ä:¯Oò9T:1G4qÞ.ÌtÉÑëEæáHÔ׬¡ª                                                                                         çc^
    nÍPÑU7/ÄñcªXâ§nc]¾¨XPayÚGºxª.wÈç¤}¬ÓÏÇ\rf`¤ñ@zJnî´a'¾¨s­NÔAëG½PL6ºIQkíJÍçؼÔKýF¾)$\&§^»                                                                                            Eý¨_{tÂp¥ñT`mùPvcìÃç1ÿûKáz¹â®ò÷p×Ø?äIIö 6²¬QªMÚIµÈTã+¤i1âN¾8ɽNww²Îf¹¿kVr²ù½Ä¼Ìå±"ªúº+äÿ¥
    óv¡t5!(«:Ö+Ovl<¦aö6Kì»â2óÎåØØ|üËàÇÒ.j§·¸[ãæ¿ï`¡÷¥¾©,ÝßiÝPMåoÑéïToãw¿dyçëÀã·ó6ês\ÔR;ÕXÚ»ûÿõå▒öÁ▒¡\Ðs·~=ðÈTDÝCCijÚ`¹ÎÔ¬\·ðñ_ÿü§¯$Âõj®Û¢_]Lù¦8áÌæ²»BJÖÛn¼ûXÏjY8Ò6éØí©YóZtÛt´ÌníUè¨PGØÊzý+ÚT¦M1¥e¬åxendstreamýC~¢6A¬»hå?5µÎÍbKÏÔlwæ l▒_%L;8ê8jßQüg-í×                                                     Jâ`d¬*»ö</nä"nAíÀ ÿ]©äXĦMYS▒
    endobjÎ{°m-°õ1Hgîºû:h*µVØK°F8ñGÔÎl~V3ÄÞ!bÊcÞDGë¯×Yl(.ãâÝå`£=cü§ýÔb£ÄèMu Íëve«XîÝ£#"VØgáKÔ?öþ§®êϺݡ[3uש²Nµq÷Ú▒ßób¸l6=?'«ì>BÔ?t_Ñ  gÁ£õ=q@ÜÕÅûªE3¶L+ÕÅ©Cå}b-7Q,ì·Túlñ¨þ¦:=`î¹aÐçeÆãÜw°¥ès
    E▒ªpÇ !}¡1{¹_ZlÈë¡Á;u§·+ú,fo ä-AÏ[HM¥×▒ÌÝåìtò*9¼Â^ѧ▒aÛ`B>/Cö0Þ÷ðiNË­þÊ âÄCH´/9fVÎÉó6!vóÑ@ ðÉ!w±y;¯m$i¾äµH+·]YA|åÀD!j{øEÙ^äFÖÑ4▒ääû5þµ)Ãå*y´¹Q« 7í?NýÍ'^õ(*C4f;3ûûn³i|nIï­0uo>#n³yµ¹5§*É»&Gtê;c.9 0 objéðÜ}zÔ22T`¦E'ýX®WÈô»&Â>9=ay$àÊGWdwÂ!f·¹eMvÖ=EÞߢ¯ò^¢n`ZÜöQ!Yߧµã gÚEbØù»ÑñÓ                                                             1ªAäØÿPâ'4RÅU]xý'¬¡Â>¹æîtê3Yêy.·¬4ÖçæÍÕOß®×ñh¶ap(<</Type/Font/Subtype/Type1/BaseFont/NimbusRomNo9L-Regu                   9îî~ýÚK°ÓÑ*ÈTt÷ ØL
    /ToUnicode 8 0 R} Åta°Àj)                           _                                      Kû'Üd§éËpôKÜ~¯
    /FirstChar 0 /LastChar 255ºP!y%µRÕÖ×bðó°~®_ñA=ùjÒÜW!þy0Æ¢]ìMºõ$ÊÍD96)éàjM[îÍÙù»@y»;«!BÌaÓ;²À    ÏÞî¨ZÚ8Ýà ìÏ?å²@ÙÏû¬W$O9²ößÄ髶Âv(r·?,½ø?u«¬§ýéøZÍñÉÆSêÒfæÿ ÕÀb8ÇxØݯ¹ÅAýöµiº\ÉI$▒À}0@bâÚÕq9s'XÝ/Widths[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0®ã¥Vø![
    250 333 408 500 500 833 778 333 333 333 500 564 250 333 250 278Õ¶~~Yö*Ó}+«▒rl¥z«°       :¬Î­>2y®GmÀúÀ
    500 500 500 500 500 500 500 500 500 500 278 278 564 564 564 444
    921 722 667 667 722 611 556 722 722 333 389 722 611 889 722 722
    556 722 667 556 611 722 722 944 722 722 611 333 278 333 469 500e$<Ìßf¼péØøag#au.ÁÄè6Ý▒
    333 444 500 444 500 444 333 500 500 278 278 500 278 778 500 500
    500 500 333 389 278 500 500 722 500 500 444 480 200 480 541 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0/NimbusRomNo9L-Regu
    0 333 500 500 167 500 500 500 500 180 444 500 333 333 556 556
    0 500 500 500 250 0 453 350 333 444 444 500 1000 1000 0 444
    0 333 333 333 333 333 333 333 333 0 333 333 0 333 333 333
    1000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 889 0 276 0 0 0 0 611 722 889 310 0 0 0 0
    0 667 0 0 0 278 0 0 278 500 722 500 0 0 0 0
    

    Why can't `cat' read content of pdf files?

    • Sparhawk
      Sparhawk about 8 years
      Because it's a pdf, not plain text? cat is for plain text, e.g. outputting to a terminal emulator. How would you expect cat to output a pdf that contained images and formatting, etc.?
    • Angel Todorov
      Angel Todorov about 8 years
      @harmattan, here's a hint: ls -l -- what's the file size of that pdf file?
    • user253751
      user253751 almost 7 years
      It can. That is the content of the PDF file.
    • phuclv
      phuclv about 3 years
      catting a binary file like that may produce worse results since there may be some byte sequences in the file that match the ANSI sequences and changes the terminal behavior completely
  • Hydraxan14
    Hydraxan14 about 7 years
    Note: pdftotext is provided by the poppler-utils package in Debian 8.