An efficient way to convert document to pdf format

17,560

Solution 1

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:

time unoconv 15.\ Text-Files.pptx
real    0m8.604s

Solution 2

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

Solution 3

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:

Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf

xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf

Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf

HTH, Vasudev Ram

Share:
17,560

Related videos on Youtube

Aamir Rind
Author by

Aamir Rind

Passionate Python/Django Developer. Contact me on LinkedIn.

Updated on June 15, 2022

Comments

  • Aamir Rind
    Aamir Rind almost 2 years

    I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?

    What i have tried:

    from subprocess import Popen, PIPE
    import time
    
    def convert(src, dst):
        d = {'src': src, 'dst': dst}
        commands = [
            '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
            'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
        ]
    
        for i in range(len(commands)):
            command = commands[i]
            st = time.time()
            process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` 
            out, err = process.communicate()
            errcode = process.returncode
            if errcode != 0:
                raise Exception(err)
            en = time.time() - st
            print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))
    
    if __name__ == '__main__':
        src = '/path/to/source/file/'
        dst = '/path/to/destination/folder/'
        convert(src, dst)
    

    Output:

    Command 1: Completed in 11.91 seconds
    Command 2: Completed in 11.55 seconds
    

    Environment:

    • Linux - Ubuntu 12.04
    • Python 2.7.3

    More tools result:

    • BartoszKP
      BartoszKP over 10 years
      Note that this not a real benchmark. A single result doesn't make sense. Results should be calculated as an average from many trials, and also at least standard deviation should be presented.
    • Aamir Rind
      Aamir Rind over 10 years
      @BartoszKP Thanks for clarification. I have chosen the wrong word.
    • BartoszKP
      BartoszKP over 10 years
      Well, since you're interested in efficiency, "benchmark" is the right word to use, because that's the tool to measure efficiency. So your code is wrong, not words :)
    • Aamir Rind
      Aamir Rind over 10 years
      Yes you are correct :P but i was just trying to give a simple scenario to show my problem.
    • BartoszKP
      BartoszKP over 10 years
      I understand :) But you can never be sure if anything "strange" didn't happen on your single run - like, you've received an e-mail, OS decided to swap some memory pages to disk, GC started its work - many possibilities :)
    • Mark Ransom
      Mark Ransom over 10 years
      The Microsoft and PDF formats are both very complex. 11 seconds might not be out of line.
    • snozzwangler
      snozzwangler over 10 years
      are you trying to minimize a single run or a batch?
    • janos
      janos over 10 years
      Does it make a difference if you run those commands in the shell instead of in Python? That is, if you run /usr/bin/docsplit pdf --output dst src without Python.
    • Laur Ivan
      Laur Ivan over 10 years
      IMHO you should try running the code several times (e.g. 20) or do it for more similar files and take an average. You might benefit from OS caching (i.e. docsplit and oowriter might remain in memory between runs).
    • Aamir Rind
      Aamir Rind over 10 years
      Actually my aim is to use these commands through python and use in Django application. Whenever a user uploads a document file which is not a PDF i have to convert it to PDF first. So processing is done as soon as user uploads a file.
    • Aamir Rind
      Aamir Rind over 10 years
      Also when user uploads a file there is a schedule task is created for celery to convert that file to pdf. So single run time needed to be improved here.
  • Supreet Sethi
    Supreet Sethi over 10 years
    Python Uno is the most reliable way to get decent pdf output from various MS Office document types. It uses (Star|Libre|Open)office backend to convert document. In principle you can do more than just convert documents. You can incorporate basic routines as well. I would still use Uno very carefully. Office software are known to be memory hogs. Do look through wiki.openoffice.org/wiki/PyUNO_bridge
  • Aamir Rind
    Aamir Rind over 10 years
    Thanks for your answer i'll try and let you know :)
  • Aamir Rind
    Aamir Rind over 10 years
    Thanks for your answer i'll try and let you know :)
  • Aamir Rind
    Aamir Rind over 10 years
    Still want it more fast :P but i think that is the best time so far. Thanks
  • fatuhoku
    fatuhoku almost 8 years
    The DOCX conversion on xtopdf appears to extract the text only and strips formatting. Not amazingly useful.
  • Vasudev Ram
    Vasudev Ram almost 8 years
    @fatuhoku: Yes, it does just that. And that is what "some of the conversions have limitations," implies - as should be somewhat obvious if you had read my comment. I rely on libraries for most of the input format conversions, so if they have limitations, so does xtopdf in those cases. Straightforward. Also, not everything has to be "amazingly useful". Just "useful" is good enough for very many use cases - along with some tweaking with custom code or by hand, even. Happens all the time in real life.
  • fatuhoku
    fatuhoku almost 8 years
    Hey @Vasudev didn't mean to put down your project. It's true that I didn't read your whole answer. Too late to edit my comment. With a name like xtopdf, saying that it "doesn't do the full range of conversions" is actually an understatement, which prompted my comment for posterity.
  • Vasudev Ram
    Vasudev Ram almost 8 years
    No it isn't an understatement, because the x in the name stands for "solve for x" - which implies, like math equations involving x, that there may not be solutions for some values of x, or there may be, but they are not yet found - or not yet worked on :) Also, you admitted you didn't read my whole answer; and now you are changing the topic from one of those quoted phrases to another in midstream.
  • Vasudev Ram
    Vasudev Ram almost 8 years
    Also, the two phrases you quoted (from my answer), occur in the SECOND sentence of my answer (not somewhere much later). So, not only did you not read my whole answer, you did not even read the second sentence before commenting on it. And I even said "it may be of use" - not "will be of use" or "amazingly useful". So you are being overly critical without doing your homework - which is common on the Internet.
  • Thereissoupinmyfly
    Thereissoupinmyfly almost 6 years
    Adding pypi.org/project/pypandoc for people still looking to do this. It removes the need to use Popen to shell out the command.