An efficient way to convert document to pdf format
Solution 1
Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:
time unoconv 15.\ Text-Files.pptx
real 0m8.604s
Solution 2
Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).
Solution 3
Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:
Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf
xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf
Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf
HTH, Vasudev Ram
Related videos on Youtube
Aamir Rind
Passionate Python/Django Developer. Contact me on LinkedIn.
Updated on June 15, 2022Comments
-
Aamir Rind almost 2 years
I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and
oowriter
, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?What i have tried:
from subprocess import Popen, PIPE import time def convert(src, dst): d = {'src': src, 'dst': dst} commands = [ '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d, 'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d, ] for i in range(len(commands)): command = commands[i] st = time.time() process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` out, err = process.communicate() errcode = process.returncode if errcode != 0: raise Exception(err) en = time.time() - st print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2))) if __name__ == '__main__': src = '/path/to/source/file/' dst = '/path/to/destination/folder/' convert(src, dst)
Output:
Command 1: Completed in 11.91 seconds Command 2: Completed in 11.55 seconds
Environment:
- Linux - Ubuntu 12.04
- Python 2.7.3
More tools result:
- jodconverter took 11.32 seconds
-
BartoszKP over 10 yearsNote that this not a real benchmark. A single result doesn't make sense. Results should be calculated as an average from many trials, and also at least standard deviation should be presented.
-
Aamir Rind over 10 years@BartoszKP Thanks for clarification. I have chosen the wrong word.
-
BartoszKP over 10 yearsWell, since you're interested in efficiency, "benchmark" is the right word to use, because that's the tool to measure efficiency. So your code is wrong, not words :)
-
Aamir Rind over 10 yearsYes you are correct :P but i was just trying to give a simple scenario to show my problem.
-
BartoszKP over 10 yearsI understand :) But you can never be sure if anything "strange" didn't happen on your single run - like, you've received an e-mail, OS decided to swap some memory pages to disk, GC started its work - many possibilities :)
-
Mark Ransom over 10 yearsThe Microsoft and PDF formats are both very complex. 11 seconds might not be out of line.
-
snozzwangler over 10 yearsare you trying to minimize a single run or a batch?
-
janos over 10 yearsDoes it make a difference if you run those commands in the shell instead of in Python? That is, if you run
/usr/bin/docsplit pdf --output dst src
without Python. -
Laur Ivan over 10 yearsIMHO you should try running the code several times (e.g. 20) or do it for more similar files and take an average. You might benefit from OS caching (i.e.
docsplit
andoowriter
might remain in memory between runs). -
Aamir Rind over 10 yearsActually my aim is to use these commands through python and use in Django application. Whenever a user uploads a document file which is not a PDF i have to convert it to PDF first. So processing is done as soon as user uploads a file.
-
Aamir Rind over 10 yearsAlso when user uploads a file there is a schedule task is created for celery to convert that file to pdf. So single run time needed to be improved here.
-
Supreet Sethi over 10 yearsPython Uno is the most reliable way to get decent pdf output from various MS Office document types. It uses (Star|Libre|Open)office backend to convert document. In principle you can do more than just convert documents. You can incorporate basic routines as well. I would still use Uno very carefully. Office software are known to be memory hogs. Do look through wiki.openoffice.org/wiki/PyUNO_bridge
-
Aamir Rind over 10 yearsThanks for your answer i'll try and let you know :)
-
Aamir Rind over 10 yearsThanks for your answer i'll try and let you know :)
-
Aamir Rind over 10 yearsStill want it more fast :P but i think that is the best time so far. Thanks
-
fatuhoku almost 8 yearsThe DOCX conversion on xtopdf appears to extract the text only and strips formatting. Not amazingly useful.
-
Vasudev Ram almost 8 years@fatuhoku: Yes, it does just that. And that is what "some of the conversions have limitations," implies - as should be somewhat obvious if you had read my comment. I rely on libraries for most of the input format conversions, so if they have limitations, so does xtopdf in those cases. Straightforward. Also, not everything has to be "amazingly useful". Just "useful" is good enough for very many use cases - along with some tweaking with custom code or by hand, even. Happens all the time in real life.
-
fatuhoku almost 8 yearsHey @Vasudev didn't mean to put down your project. It's true that I didn't read your whole answer. Too late to edit my comment. With a name like
xtopdf
, saying that it "doesn't do the full range of conversions" is actually an understatement, which prompted my comment for posterity. -
Vasudev Ram almost 8 yearsNo it isn't an understatement, because the x in the name stands for "solve for x" - which implies, like math equations involving x, that there may not be solutions for some values of x, or there may be, but they are not yet found - or not yet worked on :) Also, you admitted you didn't read my whole answer; and now you are changing the topic from one of those quoted phrases to another in midstream.
-
Vasudev Ram almost 8 yearsAlso, the two phrases you quoted (from my answer), occur in the SECOND sentence of my answer (not somewhere much later). So, not only did you not read my whole answer, you did not even read the second sentence before commenting on it. And I even said "it may be of use" - not "will be of use" or "amazingly useful". So you are being overly critical without doing your homework - which is common on the Internet.
-
Thereissoupinmyfly almost 6 yearsAdding pypi.org/project/pypandoc for people still looking to do this. It removes the need to use Popen to shell out the command.