Easiest way or Best tools to convert word text to clean (X)HTML

20,685

Solution 1

I am surprised no-one has mentioned it, but HTML Tidy normally does a good job of this. I haven't used it recently, but I understand it's suitable for cleaning up HTML content exposed from Word in particular.

Solution 2

A long time ago I was tasked with taking a reasonably well structured multi-megabyte word document and converting it into a series of HTML pages (about 20,000 of them!) This was accomplished by saving the word doc as RTF (Word's Save As HTML output was much too "dirty") and converting the RTF to HTML via a Perl script. The conversion was a two pass process... First clean up common formatting errors, then convert the cleaned RTF to HTML.

Since the document editors continued to maintain the Word document, it payed to codify common formatting errors in the first pass because the errors often reoccurred even after being fixed.

Incidentally, this process showed a very skeptical management how in just 40 hours (or so) a good coder could produce ~20,000 web pages and keep them up to date indefinitely, while the original authors (who's time was even more valuable) would have spend multiple hundreds of hours doing the conversion and would have been forced to maintain the resulting HTML by hand thereafter.

Solution 3

I use TinyMCE to strip down and convert single Word Documents. It is free, provided you can upload it to your web host (assuming you have one). I protect my installation to avoid spammage, but you can use their demo at http://tinymce.moxiecode.com/tryit/full.php.

It actually does the job better than most stand-alone conversion programs that I have tried, at least for how I use it.

Solution 4

The easiest and faster way for me is to copy all the text from Word and paste it into the wysiwyg editor of Dreamweaver (any version from MX to CS3) using the paste special command and choosing to keep just the structure of the document. It works great if your word document is not too complex, and if it is really complex you need just an extra editing in the code view. The resulting html is really clean.

The only problem with this method is that you need Dreamweaver, that is not free. Anyway, you can test the method with the trial version of DW.

Solution 5

Necromancing:

Open the Word-Document in Word 2013.
Save as odt (OpenOffice Document).
Open with OpenOffice
And either use
"Save As" ==> HTML-Document
or use

"File" ==> Export ==> XHTML

Export will require the JRE installed, Save as will not.

For Word, you can either use COM-interop, or you can use Aspose Words.

You can also directly use aspose.words, and just remove the "copyright" text with xpath query ;)

Share:
20,685

Related videos on Youtube

Boris Smirnov
Author by

Boris Smirnov

Drupal Developer, Designer , Web developer, UI/UX ninja

Updated on April 14, 2022

Comments

  • Boris Smirnov
    Boris Smirnov about 2 years

    This might have been asked in another way. I am not doing it on the fly however. Once in a while we get pieces of content in word files that have em dashes, bold, italic text and block quotes. Is there a good tool to convert this into a clean html code.

    Otherwise what other approaches people take.

  • Rabeel
    Rabeel over 15 years
    Also you should be able to automate the "paste into Notepad" part by Calling GetText on the Clipboard object with the appropiate type.
  • David Burrows
    David Burrows about 14 years
    Tried it on current Word version, didn't get a good result at all - may handle older version HTML output better.