PDF to XML and back to PDF again

13,054

Solution 1

The only chance of a lossless conversion from PDF to XML is to use a target XML vocabulary which has the same view of documents that PDF has. Since PDF's view of documents is focused primarily if not exclusively on presentation, and the usual motivation for the design of XML vocabularies like Docbook is to capture higher-level abstractions, you face two difficulties: (1) presentation-oriented XML vocabularies are not thick on the ground, and (2) if you want to go from PDF to a more conventional XML vocabulary (either directly or via a presentation-oriented XML) you will be pushing water uphill, trying to interpret the presentation of the document in terms of the higher-level abstractions of your target vocabulary. It will be very difficult, at best, to automate such a process.

If this is a kind of thought experiment and you are thinking about the PDF-XML-PDF round trip to see when and how it's possible, then you now know the reasons some people will give for believing it's not possible in any general form. If you want this PDF-to-PDF data flow for some practical reason, you might want to reflect on whether your practical goals can be met in some other way.

Solution 2

If your documents are in any way like full-text articles (e.g. http://pdfx.cs.man.ac.uk/example.pdf), PDFX might be able to help.

It converts PDF articles to XML similar in structure to Docbook documents. It also tries to retain some positioning information about the extracted elements as they were found in the original PDF (e.g. page & column numbers) that could help you go from PDFX XML to the Docbook XML you already make PDFs out of.

Example input/output: http://pdfx.cs.man.ac.uk/example

Usage: http://pdfx.cs.man.ac.uk/usage

You might also consider the Tex alternative to XSL-FO, TeXML. I had an old XSL to turn PDFX-like XML into .texml, then texml could turn it into .tex.

(Disclosure: I wrote PDFX.)

Share:
13,054
Paul Bergström
Author by

Paul Bergström

Digital archivist working mainly in Linux. Trying to develop long term archive systems fully based on opensource software.

Updated on August 31, 2022

Comments

  • Paul Bergström
    Paul Bergström over 1 year

    Well I recently asked a question about getting a PDF-file to become an XML-file and then return it to a PDF-file preferably exactly the same as the original, but at least almost the same.

    I've been trying different methods and so far I came up with this one.

    1. The document written in LibreOffice gets saved as DocBook XML. Say it's named "file.xml".
    2. This file is parsed with a set of XSL templates from the DocBook-project initiated by the file "docbook.xsl".
    3. This gets done by running: xsltproc -o intermediate-fo-file.fo /usr/share/xml/docbook/stylesheet/nwalsh/fo/docbook.xsl file.xml
    4. The result is an intermediate XSL-FO which becomes a PDF by running: fop intermediate-fo-file.fo final.pdf
    5. This PDF-file looks almost the same as the original ODT-file.

    But still, say I have a PDF-file in the beginning, how could the same thing be done? Any suggestions?