.doc to pdf using python
Solution 1
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
Solution 2
You can use the docx2pdf
python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
Solution 3
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))"
means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
Solution 4
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
Solution 5
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
nik
Updated on July 05, 2022Comments
-
nik almost 2 years
I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
-
nik about 13 yearsI'am a linux/Unix user and more inclined towards python. But the ps script looks pretty simple and exactly what I was looking for. Thanks :)
-
nik almost 13 yearsThis is exactly what I was looking for. Thanks :)
-
ecoe about 10 yearsFor many files, consider setting:
word.Visible = False
to save time and processing of the word files (MS word will not display this way, code will run in background essentially) -
Snorfalorpagus about 9 yearsI've managed to get this working for powerpoint documents. Use
Powerpoint.Application
,Presentations.Open
andFileFormat=32
. -
slaveCoder over 7 yearsI am using a linux server and these libraries dont work in linux.. is there any other way to make it work in linux
-
Aman Gautam about 7 years8 years into development and it's this answer that made me feel bad that I am not on Windows!
-
user3732708 about 7 yearswhen I run this, came an error
File "test.py", line 7, in <module> in_file = os.path.abspath(sys.argv[1]) IndexError: list index out of range
-
Peter Wood almost 7 years@user3732708
argv[1]
andargv[2]
will be the names of the input and output files. You get that error if you don't specify the files on the command line. -
asetniop almost 6 yearsWhen running the doc.SaveAs() command I got an error and had to drop the "FileFormat=" prefix, and then it worked fine.
-
Todd over 4 yearsJust used your package to print my .docx file. It worked like a charm! Couldn't have been simpler to use. Great job!
-
Al Johri over 4 yearsThanks @Todd! Give the repo a star when you get a chance.
-
Al Johri about 4 years@Abdelhedihlel Unfortunately, it requires Microsoft Office to be installed and thus only works on Windows and macOS.
-
abdelhedi hlel about 4 years@AlJohri take a look here michalzalecki.com/converting-docx-to-pdf-using-python this solution works on both windows and linux. runnig on linux it's a must bcause the most of deployement servers use linux
-
Basj about 4 yearsCan you include a sample code to show how to do it from a python script (
import unoconv
unoconv.dosomething(...)
)? The documentation only shows how to do it from command line. -
Basj about 4 yearsDo you have an example working with LibreOffice?
word = comtypes.client.CreateObject('LibreWriter.Application')
doesn't work. -
Vishesh Mangla almost 4 yearsis there any way to just use file objects and avoid these file saves?
-
Vishesh Mangla almost 4 yearsis there a method to convert word file object to pdf in your module?
-
rain about 3 yearsWill it help to preserve bookmarks that were created on the docx file?
-
Att Righ about 2 years"Please note that there is a rewrite of Unoconv called "Unoserver": github.com/unoconv/unoserver We are running Unoserver successfully in production, and it’s now the recommended solution. Unoserver does not have all the features of Unoconv, which features it will get depends on a combination of what people want, and if someone wants to implement it. Until Unoserver has all the major features people need, Unoconv is in bugfix mode, there will be no major changes...." from github.com/unoconv/unoconv I'm think I'm placing my money on unoconv still.
-
Att Righ about 2 yearsHeads up for other uses, I had some issue making unoconv work. The approach I went for (which works okay on linux and within docker) was called libreoffice directly as described in this answer.
-
not2qubit about 2 yearsThis is not using Python, this is just running the libre office exe from a python script.
-
not2qubit about 2 yearsWhat are you importing?