How to convert txt file or PDF to Word doc using python?

14,377

Solution 1

Using python-docx I was able to pretty easily convert the txt files to Word docs.

Here's what I did.

from docx import Document
import re
import os

path = '/users/tdobbins/downloads/smithtxt'
direct = os.listdir(path)

for i in direct:
    document = Document()
    document.add_heading(i, 0)
    myfile = open('/path/to/read/from/'+i).read()
    myfile = re.sub(r'[^\x00-\x7F]+|\x0c',' ', myfile) # remove all non-XML-compatible characters
    p = document.add_paragraph(myfile)
    document.save('/path/to/write/to/'+i+'.docx')

Solution 2

You could check out python-docx. It can create Word docs with python so you could store the text files into word. See python-docx - what-it-can-do

Share:
14,377
tmthyjames
Author by

tmthyjames

Updated on June 04, 2022

Comments

  • tmthyjames
    tmthyjames almost 2 years

    Is there a way to convert PDFs (or text files) to Word docs in python? I'm doing some web-scraping for my professor and the original docs are PDFs. I converted all 1,611 of those to text files and now we need to convert them to Word docs. The only thing I could find was a Word-to-txt converter, not the reverse.

    Thanks!

  • tmthyjames
    tmthyjames about 9 years
    Thanks. I'm checking it out. Other than installing it being a pain, it looks like it'll work.
  • Anmol Monga
    Anmol Monga about 6 years
    I don't want to do formatting by my code. Is there any way which accept input file and covert that to .doc/.docx?
  • Anmol Monga
    Anmol Monga about 6 years
    I don't want to do formatting by my code. Is there any way which accept input file and covert that to .doc/.docx?