Convert PDF to .docx with Python

python pdf docx libreoffice doc

20,526

Solution 1

I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)

soffice --convert-to html ./my_pdf_file.pdf

then

soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html

You should end up with:

my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx

Now wrap the commands in your subprocess code

Solution 2

I use this for multiple files

####
from pdf2docx import Converter
import os

# # # dir_path for input reading and output files & a for loop # # #

path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'

for file in os.listdir(path_input):
    cv = Converter(path_input+file)
    cv.convert(path_output+file+'.docx', start=0, end=None)
    cv.close()
    print(file)

Solution 3

Below code worked for me.

import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16)  # file format for docx
wb1.Close()
word.Quit()

Solution 4

My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.

#Description: This python script will allow you to fetch text information from a pdf file

#import libraries

import PyPDF2
import os
import docx

mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object

# Loop through all the pages

for pageNum in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(pageNum)
        pdfContent = pageObj.extractText()  #extracts the content from the page. 
        print(pdfContent) # print statement to test output in the terminal. codeline optional.
        mydoc.add_paragraph(pdfContent) # this adds the content to the word document
        
mydoc.save("pdf/filename.docx") # Give a name to your output file.

View more solutions

20,526

Author by

Also

Updated on October 14, 2021

Comments

Also over 2 years
I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.

I have seen other posts related with this, but none of them seem to work correctly in my case.

I'm using specifically
```
import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)
```
This gives me Output[1], but then, I can't find any .docx document in my folder.

I have LibreOffice 5.3 installed.

Any clues about it?

Thank you in advance!
PythonProgrammi over 2 years

It says that it's impossible to opend the file