An efficient way to convert document to pdf format

python pdf ubuntu document-conversion docsplit

17,560

Solution 1

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:

time unoconv 15.\ Text-Files.pptx
real    0m8.604s

Solution 2

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

Solution 3

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:

Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf

xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf

Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf

HTH, Vasudev Ram

17,560

Aamir Rind

Passionate Python/Django Developer. Contact me on LinkedIn.

Updated on June 15, 2022

Comments

Aamir Rind almost 2 years
I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?

What i have tried:
```
from subprocess import Popen, PIPE
import time

def convert(src, dst):
    d = {'src': src, 'dst': dst}
    commands = [
        '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
        'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
    ]

    for i in range(len(commands)):
        command = commands[i]
        st = time.time()
        process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` 
        out, err = process.communicate()
        errcode = process.returncode
        if errcode != 0:
            raise Exception(err)
        en = time.time() - st
        print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))

if __name__ == '__main__':
    src = '/path/to/source/file/'
    dst = '/path/to/destination/folder/'
    convert(src, dst)
```
Output:
```
Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds
```
Environment:
- Linux - Ubuntu 12.04
- Python 2.7.3
More tools result:
- jodconverter took 11.32 seconds
- BartoszKP over 10 years
  
  Note that this not a real benchmark. A single result doesn't make sense. Results should be calculated as an average from many trials, and also at least standard deviation should be presented.
- Aamir Rind over 10 years
  
  @BartoszKP Thanks for clarification. I have chosen the wrong word.
- BartoszKP over 10 years
  
  Well, since you're interested in efficiency, "benchmark" is the right word to use, because that's the tool to measure efficiency. So your code is wrong, not words :)
- Aamir Rind over 10 years
  
  Yes you are correct :P but i was just trying to give a simple scenario to show my problem.
- BartoszKP over 10 years
  
  I understand :) But you can never be sure if anything "strange" didn't happen on your single run - like, you've received an e-mail, OS decided to swap some memory pages to disk, GC started its work - many possibilities :)
- Mark Ransom over 10 years
  
  The Microsoft and PDF formats are both very complex. 11 seconds might not be out of line.
- snozzwangler over 10 years
  
  are you trying to minimize a single run or a batch?
- janos over 10 years
  
  Does it make a difference if you run those commands in the shell instead of in Python? That is, if you run /usr/bin/docsplit pdf --output dst src without Python.
- Laur Ivan over 10 years
  
  IMHO you should try running the code several times (e.g. 20) or do it for more similar files and take an average. You might benefit from OS caching (i.e. docsplit and oowriter might remain in memory between runs).
- Aamir Rind over 10 years
  
  Actually my aim is to use these commands through python and use in Django application. Whenever a user uploads a document file which is not a PDF i have to convert it to PDF first. So processing is done as soon as user uploads a file.
- Aamir Rind over 10 years
  
  Also when user uploads a file there is a schedule task is created for celery to convert that file to pdf. So single run time needed to be improved here.
Supreet Sethi over 10 years

Python Uno is the most reliable way to get decent pdf output from various MS Office document types. It uses (Star|Libre|Open)office backend to convert document. In principle you can do more than just convert documents. You can incorporate basic routines as well. I would still use Uno very carefully. Office software are known to be memory hogs. Do look through wiki.openoffice.org/wiki/PyUNO_bridge
Aamir Rind over 10 years

Thanks for your answer i'll try and let you know :)
Aamir Rind over 10 years

Thanks for your answer i'll try and let you know :)
Aamir Rind over 10 years

Still want it more fast :P but i think that is the best time so far. Thanks
fatuhoku almost 8 years

The DOCX conversion on xtopdf appears to extract the text only and strips formatting. Not amazingly useful.
Vasudev Ram almost 8 years

@fatuhoku: Yes, it does just that. And that is what "some of the conversions have limitations," implies - as should be somewhat obvious if you had read my comment. I rely on libraries for most of the input format conversions, so if they have limitations, so does xtopdf in those cases. Straightforward. Also, not everything has to be "amazingly useful". Just "useful" is good enough for very many use cases - along with some tweaking with custom code or by hand, even. Happens all the time in real life.
fatuhoku almost 8 years

Hey @Vasudev didn't mean to put down your project. It's true that I didn't read your whole answer. Too late to edit my comment. With a name like xtopdf, saying that it "doesn't do the full range of conversions" is actually an understatement, which prompted my comment for posterity.
Vasudev Ram almost 8 years

No it isn't an understatement, because the x in the name stands for "solve for x" - which implies, like math equations involving x, that there may not be solutions for some values of x, or there may be, but they are not yet found - or not yet worked on :) Also, you admitted you didn't read my whole answer; and now you are changing the topic from one of those quoted phrases to another in midstream.
Vasudev Ram almost 8 years

Also, the two phrases you quoted (from my answer), occur in the SECOND sentence of my answer (not somewhere much later). So, not only did you not read my whole answer, you did not even read the second sentence before commenting on it. And I even said "it may be of use" - not "will be of use" or "amazingly useful". So you are being overly critical without doing your homework - which is common on the Internet.
Thereissoupinmyfly almost 6 years

Adding pypi.org/project/pypandoc for people still looking to do this. It removes the need to use Popen to shell out the command.