Convert PDF to DOC (Python/Bash)
Solution 1
This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.
However, you are most definitely going to lose presentational aspects in the conversion.
Solution 2
If you want to convert PDF -> MS Word type file like docx, I came across this.
Ahsin Shabbir wrote:
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfile\n",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()
This worked like a charm for me, converted 500 pages PDF with formatting and images.
Solution 3
You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.
Sample Python code:
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload soruce file to storage
filename = 'Sample.pdf'
remote_name = 'Sample.pdf'
output_name= 'sample.docx'
strformat='docx'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Convert PDF to Word document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
loadOptions.hide_pdf_annotations = True
loadOptions.remove_embedded_files = False
loadOptions.flatten_all_fields = True
settings.load_options = loadOptions
convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
convertOptions.from_page = 1
convertOptions.pages_count = 1
settings.convert_options = convertOptions
.
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
I'm developer evangelist at aspose.
Solution 4
Based on previews answers this was the solution that worked best for me using Python 3.7.1
import win32com.client
import os
# INPUT/OUTPUT PATH
pdf_path = r"""C:\path2pdf.pdf"""
output_path = r"""C:\output_folder"""
word = win32com.client.Dispatch("Word.Application")
word.visible = 0 # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD
# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\\')[-1]
in_file = os.path.abspath(pdf_path)
# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()
AlvaroAV
Specialties Django & Python (7+ years of experience) AngularJS 1.x & 2+ (5+ years) Javascript,JQuery, HTML5, CMS (7+ years) Web Engines (ElasticSearch & Haystack) Mobile app developer (5+ years) Ionic 2 & 3 & 4 (5+ years) Web scrapping, Selenium (5+ years) Backend scripts (Pure C, Python, Perl) Hobbies: Godot, RaspberryPi, Arduino, Other Profiles: Linkedin, Github, JSFiddle
Updated on October 14, 2021Comments
-
AlvaroAV over 2 years
I've saw some pages that allow user to upload
PDF
and returns aDOC
file, like PdfToWordIs there any way to convert a
PDF
file to aDOC/DOCX
file using Python or any Unix command ?Thanks in advance