How to generate plain-text source-code PDF examples that work in a document viewer?

linux pdf command-line pdf-generation

10,543

Solution 1

You should append a (syntactically correct) xref and trailer section to the end of the file. That means: each object in your PDF needs one line in the xref table, even if the byte offset isn't correctly stated. Then Ghostscript, pdftk or qpdf can re-establish a correct xref and render the file:

[...]
endobj
xref 
0 8 
0000000000 65535 f 
0000000010 00000 n 
0000000020 00000 n 
0000000030 00000 n 
0000000040 00000 n 
0000000050 00000 n 
0000000060 00000 n 
0000000070 00000 n 
trailer 
<</Size 8/Root 1 0 R>> 
startxref 
555 
%%EOF

Solution 2

Ah damn it - I had copied just a part of the code; the OP code is the one on pg 701 - then there is a footer which confused me; otherwise the code continues on pg 702 :/

(EDIT: also see Introduction to PDF - GNUpdf (archive) for a similar, more detailed example)

So here is the complete code:

%PDF-1.4
1 0 obj
  << /Type /Catalog
      /Outlines 2 0 R
      /Pages 3 0 R
  >>
endobj

2 0 obj
  << /Type /Outlines
      /Count 0
  >>
endobj

3 0 obj
  << /Type /Pages
      /Kids [ 4 0 R ]
      /Count 1
  >>
endobj

4 0 obj
  << /Type /Page
      /Parent 3 0 R
      /MediaBox [ 0 0 612 792 ]
      /Contents 5 0 R
      /Resources << /ProcSet 6 0 R
      /Font << /F1 7 0 R >>
  >>
>>
endobj

5 0 obj
  << /Length 73 >>
stream
  BT
    /F1 24 Tf
    100 100 Td
    ( Hello World ) Tj
  ET
endstream
endobj

6 0 obj
  [ /PDF /Text ]
endobj

7 0 obj
  << /Type /Font
    /Subtype /Type1
    /Name /F1
    /BaseFont /Helvetica
    /Encoding /MacRomanEncoding
  >>
endobj

xref
0 8
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n
0000000179 00000 n
0000000364 00000 n
0000000466 00000 n
0000000496 00000 n

trailer
  << /Size 8
    /Root 1 0 R
  >>
startxref
625
%%EOF

Indeed, as the error messages were saying, xref section was missing!

However, this is still not the end - while this document will open in evince, evince will still complain:

$ evince hello.pdf 
Error: PDF file is damaged - attempting to reconstruct xref table...

... and so will qpdf:

$ qpdf --check hello.pdf
WARNING: hello.pdf: file is damaged
WARNING: hello.pdf (file position 625): xref not found
WARNING: hello.pdf: Attempting to reconstruct cross-reference table
checking hello.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
WARNING: hello.pdf (object 5 0, file position 436): attempting to recover stream length

So to actually get a proper example, as the Adobe Forums: Simple Text String Example in specification broken. points out, xref table needs to be reconstructed (have correct byte offsets).

And in order to do this, we can use pdftk to "Repair a PDF's Corrupted XREF Table and Stream Lengths (If Possible)":

$ pdftk hello.pdf output hello_repair.pdf

... and now hello_repair.pdf opens in evince without a problem - and qpdf reports:

$ qpdf --check hello_repair.pdf
checking hello_repair.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
No errors found

Well, hope this helps someone,
Cheers!

10,543

Author by

sdaau

Updated on June 30, 2022

Comments

sdaau almost 2 years
I just found the post Adobe Forums: Simple Text String Example in specification broken., so I got interested in finding plain-text source code PDF examples.

So, through that post, I eventually found:
- The webpage PDF Reference and Adobe Extensions to the PDF Specification | Adobe Developer Connection ; which contains:
  - The PDF Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition (PDF32000_2008.pdf)
The PDF 1.7 spec has on page 699 appendix "_Annex H (informative) Example PDF files"; and from there, I wanted to try "H.3 Simple Text String Example" (the "classic Hello World").

So I tried to save this as hello.pdf (_except note when you copy from the PDF32000_2008.pdf, you may get "%PDF-1. 4" - that is, a space inserted after 1., which must be removed_) :
```
%PDF-1.4
1 0 obj
  << /Type /Catalog
      /Outlines 2 0 R
      /Pages 3 0 R
  >>
endobj

2 0 obj
  << /Type /Outlines
      /Count 0
  >>
endobj

3 0 obj
  << /Type /Pages
      /Kids [ 4 0 R ]
      /Count 1
  >>
endobj

4 0 obj
  << /Type /Page
      /Parent 3 0 R
      /MediaBox [ 0 0 612 792 ]
      /Contents 5 0 R
      /Resources << /ProcSet 6 0 R
      /Font << /F1 7 0 R >>
  >>
>>
endobj

5 0 obj
  << /Length 73 >>
stream
  BT
    /F1 24 Tf
    100 100 Td
    ( Hello World ) Tj
  ET
endstream
endobj
```
... and I'm trying to open it:
```
evince hello.pdf
```
... however, evince cannot open it: "Unable to open document / PDF document is damaged"; and also:
```
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
```
I also check with qpdf:
```
$ qpdf --check hello.pdf
WARNING: hello.pdf: file is damaged
WARNING: hello.pdf: can't find startxref
WARNING: hello.pdf: Attempting to reconstruct cross-reference table
hello.pdf: unable to find trailer dictionary while recovering damaged file
```
Where am I going wrong with this?

Many thanks in advance for any answers,
Cheers!
sdaau almost 12 years

Indeed - thanks for that @pipitas; I had also realized the same and documented in this post; cheers!
yms about 9 years

@KurtPfeifle I have found that for simple PDF files (with no compressed streams for example) just putting consecutive object ids, skipping the xref table and putting startxref 0 %%EOF works as a charm for acrobat reader to open the file and generate a new table.
Kurt Pfeifle about 9 years

@yms: Yes, but the "saved" PDF may contain code that is and looks completely different from your original PDF source code. So it is not an option for cases where you want to write PDF files that serve as learning/teaching/studying material.