How to generate plain-text source-code PDF examples that work in a document viewer?

10,543

Solution 1

You should append a (syntactically correct) xref and trailer section to the end of the file. That means: each object in your PDF needs one line in the xref table, even if the byte offset isn't correctly stated. Then Ghostscript, pdftk or qpdf can re-establish a correct xref and render the file:

[...]
endobj
xref 
0 8 
0000000000 65535 f 
0000000010 00000 n 
0000000020 00000 n 
0000000030 00000 n 
0000000040 00000 n 
0000000050 00000 n 
0000000060 00000 n 
0000000070 00000 n 
trailer 
<</Size 8/Root 1 0 R>> 
startxref 
555 
%%EOF 

Solution 2

Ah damn it - I had copied just a part of the code; the OP code is the one on pg 701 - then there is a footer which confused me; otherwise the code continues on pg 702 :/

(EDIT: also see Introduction to PDF - GNUpdf (archive) for a similar, more detailed example)

So here is the complete code:

%PDF-1.4
1 0 obj
  << /Type /Catalog
      /Outlines 2 0 R
      /Pages 3 0 R
  >>
endobj

2 0 obj
  << /Type /Outlines
      /Count 0
  >>
endobj

3 0 obj
  << /Type /Pages
      /Kids [ 4 0 R ]
      /Count 1
  >>
endobj

4 0 obj
  << /Type /Page
      /Parent 3 0 R
      /MediaBox [ 0 0 612 792 ]
      /Contents 5 0 R
      /Resources << /ProcSet 6 0 R
      /Font << /F1 7 0 R >>
  >>
>>
endobj

5 0 obj
  << /Length 73 >>
stream
  BT
    /F1 24 Tf
    100 100 Td
    ( Hello World ) Tj
  ET
endstream
endobj

6 0 obj
  [ /PDF /Text ]
endobj

7 0 obj
  << /Type /Font
    /Subtype /Type1
    /Name /F1
    /BaseFont /Helvetica
    /Encoding /MacRomanEncoding
  >>
endobj

xref
0 8
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n
0000000179 00000 n
0000000364 00000 n
0000000466 00000 n
0000000496 00000 n

trailer
  << /Size 8
    /Root 1 0 R
  >>
startxref
625
%%EOF

Indeed, as the error messages were saying, xref section was missing!

However, this is still not the end - while this document will open in evince, evince will still complain:

$ evince hello.pdf 
Error: PDF file is damaged - attempting to reconstruct xref table...

... and so will qpdf:

$ qpdf --check hello.pdf
WARNING: hello.pdf: file is damaged
WARNING: hello.pdf (file position 625): xref not found
WARNING: hello.pdf: Attempting to reconstruct cross-reference table
checking hello.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
WARNING: hello.pdf (object 5 0, file position 436): attempting to recover stream length

So to actually get a proper example, as the Adobe Forums: Simple Text String Example in specification broken. points out, xref table needs to be reconstructed (have correct byte offsets).

And in order to do this, we can use pdftk to "Repair a PDF's Corrupted XREF Table and Stream Lengths (If Possible)":

$ pdftk hello.pdf output hello_repair.pdf

... and now hello_repair.pdf opens in evince without a problem - and qpdf reports:

$ qpdf --check hello_repair.pdf
checking hello_repair.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
No errors found

Well, hope this helps someone,
Cheers!

Share:
10,543
sdaau
Author by

sdaau

Updated on June 30, 2022

Comments

  • sdaau
    sdaau almost 2 years

    I just found the post Adobe Forums: Simple Text String Example in specification broken., so I got interested in finding plain-text source code PDF examples.

    So, through that post, I eventually found:

    The PDF 1.7 spec has on page 699 appendix "_Annex H (informative) Example PDF files"; and from there, I wanted to try "H.3 Simple Text String Example" (the "classic Hello World").

    So I tried to save this as hello.pdf (_except note when you copy from the PDF32000_2008.pdf, you may get "%PDF-1. 4" - that is, a space inserted after 1., which must be removed_) :

    %PDF-1.4
    1 0 obj
      << /Type /Catalog
          /Outlines 2 0 R
          /Pages 3 0 R
      >>
    endobj
    
    2 0 obj
      << /Type /Outlines
          /Count 0
      >>
    endobj
    
    3 0 obj
      << /Type /Pages
          /Kids [ 4 0 R ]
          /Count 1
      >>
    endobj
    
    4 0 obj
      << /Type /Page
          /Parent 3 0 R
          /MediaBox [ 0 0 612 792 ]
          /Contents 5 0 R
          /Resources << /ProcSet 6 0 R
          /Font << /F1 7 0 R >>
      >>
    >>
    endobj
    
    5 0 obj
      << /Length 73 >>
    stream
      BT
        /F1 24 Tf
        100 100 Td
        ( Hello World ) Tj
      ET
    endstream
    endobj
    

    ... and I'm trying to open it:

    evince hello.pdf
    

    ... however, evince cannot open it: "Unable to open document / PDF document is damaged"; and also:

    Error: PDF file is damaged - attempting to reconstruct xref table...
    Error: Couldn't find trailer dictionary
    Error: Couldn't read xref table
    

    I also check with qpdf:

    $ qpdf --check hello.pdf
    WARNING: hello.pdf: file is damaged
    WARNING: hello.pdf: can't find startxref
    WARNING: hello.pdf: Attempting to reconstruct cross-reference table
    hello.pdf: unable to find trailer dictionary while recovering damaged file
    

    Where am I going wrong with this?

    Many thanks in advance for any answers,
    Cheers!

  • sdaau
    sdaau almost 12 years
    Indeed - thanks for that @pipitas; I had also realized the same and documented in this post; cheers!
  • yms
    yms about 9 years
    @KurtPfeifle I have found that for simple PDF files (with no compressed streams for example) just putting consecutive object ids, skipping the xref table and putting startxref 0 %%EOF works as a charm for acrobat reader to open the file and generate a new table.
  • Kurt Pfeifle
    Kurt Pfeifle about 9 years
    @yms: Yes, but the "saved" PDF may contain code that is and looks completely different from your original PDF source code. So it is not an option for cases where you want to write PDF files that serve as learning/teaching/studying material.