HTML/PDF to DOC(X) in Linux command line?
Solution 1
I've just stumbled on this question and after a bit more googling, found pandoc: http://johnmacfarlane.net/pandoc/README.html
A simple command will create a docx or pdf (or rtf etc) file from html input like so:
pandoc -o output.docx input.html
It can also write to stdout (with some formats) and read from stdin.
Not sure if it's in the debian / ubuntu repos but it's in the epel 6 repo for Redhat / CentOS 6 ( yum install pandoc
).
Hope this helps someone :)
Solution 2
You can convert HTML into .doc using an OpenOffice macro, see this thread:
http://www.oooforum.org/forum/viewtopic.phtml?p=44367#44367
converting pdf to .doc is much harder, due the multitude of different content that could be inside a PDF - quite often PDFs are used for things such as scanned text.
Solution 3
You can use pdftohtml
to make an html file from a pdf.
Word can open html files directly.
Solution 4
You might be able to do the latter using OpenOffice from the command line. There are also bridges for Scripting languages - find out more on OpenOffice's website. There is one for PHP called PUNO, however I have no personal experience with it yet.
Related videos on Youtube
studiohack
Updated on September 17, 2022Comments
-
studiohack over 1 year
I need to convert PDF or HTML+CSS into DOC or DOCX under Linux, it can be from the command line or with a scripting language.
Any idea?
-
saunderl over 14 yearsno one works for me
-
Sevki over 14 yearsIt parses HTML very badly, ignoring most of it's CSS