Easiest way or Best tools to convert word text to clean (X)HTML
Solution 1
I am surprised no-one has mentioned it, but HTML Tidy normally does a good job of this. I haven't used it recently, but I understand it's suitable for cleaning up HTML content exposed from Word in particular.
Solution 2
A long time ago I was tasked with taking a reasonably well structured multi-megabyte word document and converting it into a series of HTML pages (about 20,000 of them!) This was accomplished by saving the word doc as RTF (Word's Save As HTML output was much too "dirty") and converting the RTF to HTML via a Perl script. The conversion was a two pass process... First clean up common formatting errors, then convert the cleaned RTF to HTML.
Since the document editors continued to maintain the Word document, it payed to codify common formatting errors in the first pass because the errors often reoccurred even after being fixed.
Incidentally, this process showed a very skeptical management how in just 40 hours (or so) a good coder could produce ~20,000 web pages and keep them up to date indefinitely, while the original authors (who's time was even more valuable) would have spend multiple hundreds of hours doing the conversion and would have been forced to maintain the resulting HTML by hand thereafter.
Solution 3
I use TinyMCE to strip down and convert single Word Documents. It is free, provided you can upload it to your web host (assuming you have one). I protect my installation to avoid spammage, but you can use their demo at http://tinymce.moxiecode.com/tryit/full.php.
It actually does the job better than most stand-alone conversion programs that I have tried, at least for how I use it.
Solution 4
The easiest and faster way for me is to copy all the text from Word and paste it into the wysiwyg editor of Dreamweaver (any version from MX to CS3) using the paste special command and choosing to keep just the structure of the document. It works great if your word document is not too complex, and if it is really complex you need just an extra editing in the code view. The resulting html is really clean.
The only problem with this method is that you need Dreamweaver, that is not free. Anyway, you can test the method with the trial version of DW.
Solution 5
Necromancing:
Open the Word-Document in Word 2013.
Save as odt (OpenOffice Document).
Open with OpenOffice
And either use "Save As" ==> HTML-Document
or use
"File" ==> Export ==> XHTML
Export will require the JRE installed, Save as will not.
For Word, you can either use COM-interop, or you can use Aspose Words.
You can also directly use aspose.words, and just remove the "copyright" text with xpath query ;)
Related videos on Youtube
Boris Smirnov
Drupal Developer, Designer , Web developer, UI/UX ninja
Updated on April 14, 2022Comments
-
Boris Smirnov about 2 years
This might have been asked in another way. I am not doing it on the fly however. Once in a while we get pieces of content in word files that have em dashes, bold, italic text and block quotes. Is there a good tool to convert this into a clean html code.
Otherwise what other approaches people take.
-
Rabeel over 15 yearsAlso you should be able to automate the "paste into Notepad" part by Calling GetText on the Clipboard object with the appropiate type.
-
David Burrows about 14 yearsTried it on current Word version, didn't get a good result at all - may handle older version HTML output better.