From MS Word or Libre Office to clean HTML

10,306

Solution 1

I was using http://word2cleanhtml.com/ till i realised that MS Word itself gives the option to save document as HTML.

On selecting this, the .docx file becomes .html and is the best html version of a word doc that i've seen. Its certainly better than all these online tools.

Solution 2

I realize this question is old but the other answers never really answered the question. If you are not adverse to writing some PHP code, the CubicleSoft Ultimate Web Scraper Toolkit has a class called TagFilter:

https://github.com/cubiclesoft/ultimate-web-scraper/blob/master/support/tag_filter.php

You pass in two things: An array of options and the data to parse as HTML.

For cleaning up broken HTML, the default options from TagFilter::GetHTMLOptions() will act as a good starting point. Those options form the basis of valid HTML content and, doing nothing else, will clean up any input data into something that another tool like Simple HTML DOM can correctly parse in a DOM model.

However, the other way to use the class is to modify the default options and add a 'callback' option to the options array. For every tag in the HTML, the specified callback function will be called. The callback is expected to return what to do with each tag, which is where the real power of TagFilter comes into play. You can keep any given tag and some or all of its attributes (or modifying them), get rid of the tag but keep the interior content, keep the tag but get rid of the content, modify the content (for closing tags), or get rid of both the tag and interior content. This approach allows extremely refined control over the most convoluted HTML out there and processes the input in a single pass. See the same repository's test suite for example usage of TagFilter.

The only downside is that the callback has to keep track of where it is at between each call whereas something like Simple HTML DOM selects things based on a DOM-like model. BUT that's only a drawback if the document being processed has things like 'id's and 'class'es...most Word/Libre HTML content does not, which means it is a giant blob of unrecognizable/unparseable HTML as far as DOM processing tools go.

Share:
10,306
Erel Segal-Halevi
Author by

Erel Segal-Halevi

I am a faculty member in Ariel University, computer science department. My research topic is Fair Division of Land. It is related to the classic problem of fair cake-cutting, which is a multi-disciplinary topic connecting mathematics, economics and computer science, I am always happy to discuss any topic related to land division or fair cake-cutting. If you have a new idea in these topics and need a partner for brain-storming, feel free to email me at [email protected]. The answers I receive in the Stack Exchange websites are very useful, and I often cite them in papers. See my website for examples.

Updated on June 05, 2022

Comments

  • Erel Segal-Halevi
    Erel Segal-Halevi almost 2 years

    People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.

    When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).

    I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.

    I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!

    I tried docvert: https://github.com/holloway/docvert/issues/6 but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...

    Is there a simple way to create clean HTML from Office documents?

  • Erel Segal-Halevi
    Erel Segal-Halevi over 11 years
    Using Notepad++ could be a solution for a single document, however, since I have new documents coming each week, I don't want to repeat the same replacements again and again for each document...