.doc to pdf using python

134,874

Solution 1

A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

You could also use pywin32, which would be the same except for:

import win32com.client

and then:

word = win32com.client.Dispatch('Word.Application')

Solution 2

You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf

Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf

Solution 3

I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:

(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )

(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.

After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.

    import os
    import comtypes.client
    import time


    wdFormatPDF = 17


    # absolute path is needed
    # be careful about the slash '\', use '\\' or '/' or raw string r"..."
    in_file=r'absolute path of input docx file 1'
    out_file=r'absolute path of output pdf file 1'

    in_file2=r'absolute path of input docx file 2'
    out_file2=r'absolute path of outputpdf file 2'

    # print out filenames
    print in_file
    print out_file
    print in_file2
    print out_file2


    # create COM object
    word = comtypes.client.CreateObject('Word.Application')
    # key point 1: make word visible before open a new document
    word.Visible = True
    # key point 2: wait for the COM Server to prepare well.
    time.sleep(3)

    # convert docx file 1 to pdf file 1
    doc=word.Documents.Open(in_file) # open docx file 1
    doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 1
    word.Visible = False
    # convert docx file 2 to pdf file 2
    doc = word.Documents.Open(in_file2) # open docx file 2
    doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 2   
    word.Quit() # close Word Application 

Solution 4

I have tested many solutions but no one of them works efficiently on Linux distribution.

I recommend this solution :

import sys
import subprocess
import re


def convert_to(folder, source, timeout=None):
    args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]

    process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    filename = re.search('-> (.*?) using filter', process.stdout.decode())

    return filename.group(1)


def libreoffice_exec():
    # TODO: Provide support for more platforms
    if sys.platform == 'darwin':
        return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    return 'libreoffice'

and you call your function:

result = convert_to('TEMP Directory',  'Your File', timeout=15)

All resources:

https://michalzalecki.com/converting-docx-to-pdf-using-python/

Solution 5

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.

doc.ExportAsFixedFormat(OutputFileName=pdf_file,
    ExportFormat=17, #17 = PDF output, 18=XPS output
    OpenAfterExport=False,
    OptimizeFor=0,  #0=Print (higher res), 1=Screen (lower res)
    CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
    DocStructureTags=True
    );

The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

Share:
134,874
nik
Author by

nik

Updated on July 05, 2022

Comments

  • nik
    nik almost 2 years

    I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).

    Right now I'm reading through this. Dont know how useful this is going to be.

  • nik
    nik about 13 years
    I'am a linux/Unix user and more inclined towards python. But the ps script looks pretty simple and exactly what I was looking for. Thanks :)
  • nik
    nik almost 13 years
    This is exactly what I was looking for. Thanks :)
  • ecoe
    ecoe about 10 years
    For many files, consider setting: word.Visible = False to save time and processing of the word files (MS word will not display this way, code will run in background essentially)
  • Snorfalorpagus
    Snorfalorpagus about 9 years
    I've managed to get this working for powerpoint documents. Use Powerpoint.Application, Presentations.Open and FileFormat=32.
  • slaveCoder
    slaveCoder over 7 years
    I am using a linux server and these libraries dont work in linux.. is there any other way to make it work in linux
  • Aman Gautam
    Aman Gautam about 7 years
    8 years into development and it's this answer that made me feel bad that I am not on Windows!
  • user3732708
    user3732708 about 7 years
    when I run this, came an error File "test.py", line 7, in <module> in_file = os.path.abspath(sys.argv[1]) IndexError: list index out of range
  • Peter Wood
    Peter Wood almost 7 years
    @user3732708 argv[1] and argv[2] will be the names of the input and output files. You get that error if you don't specify the files on the command line.
  • asetniop
    asetniop almost 6 years
    When running the doc.SaveAs() command I got an error and had to drop the "FileFormat=" prefix, and then it worked fine.
  • Todd
    Todd over 4 years
    Just used your package to print my .docx file. It worked like a charm! Couldn't have been simpler to use. Great job!
  • Al Johri
    Al Johri over 4 years
    Thanks @Todd! Give the repo a star when you get a chance.
  • Al Johri
    Al Johri about 4 years
    @Abdelhedihlel Unfortunately, it requires Microsoft Office to be installed and thus only works on Windows and macOS.
  • abdelhedi hlel
    abdelhedi hlel about 4 years
    @AlJohri take a look here michalzalecki.com/converting-docx-to-pdf-using-python this solution works on both windows and linux. runnig on linux it's a must bcause the most of deployement servers use linux
  • Basj
    Basj about 4 years
    Can you include a sample code to show how to do it from a python script (import unoconv unoconv.dosomething(...))? The documentation only shows how to do it from command line.
  • Basj
    Basj about 4 years
    Do you have an example working with LibreOffice? word = comtypes.client.CreateObject('LibreWriter.Application') doesn't work.
  • Vishesh Mangla
    Vishesh Mangla almost 4 years
    is there any way to just use file objects and avoid these file saves?
  • Vishesh Mangla
    Vishesh Mangla almost 4 years
    is there a method to convert word file object to pdf in your module?
  • rain
    rain about 3 years
    Will it help to preserve bookmarks that were created on the docx file?
  • Att Righ
    Att Righ about 2 years
    "Please note that there is a rewrite of Unoconv called "Unoserver": github.com/unoconv/unoserver We are running Unoserver successfully in production, and it’s now the recommended solution. Unoserver does not have all the features of Unoconv, which features it will get depends on a combination of what people want, and if someone wants to implement it. Until Unoserver has all the major features people need, Unoconv is in bugfix mode, there will be no major changes...." from github.com/unoconv/unoconv I'm think I'm placing my money on unoconv still.
  • Att Righ
    Att Righ about 2 years
    Heads up for other uses, I had some issue making unoconv work. The approach I went for (which works okay on linux and within docker) was called libreoffice directly as described in this answer.
  • not2qubit
    not2qubit about 2 years
    This is not using Python, this is just running the libre office exe from a python script.
  • not2qubit
    not2qubit about 2 years
    What are you importing?