PDF to HTML - batch converter - most reliable and accurate free AND paid for software?

5,777

Solution 1

My solution would be to 2 parts 1) to continue to use IntraPDF PDF to JPG program (I paid for it) (http://www.intrapdf.com/convert_pdf_to_html.htm) on my XP Platform (doesn't seem to work on Windows 7 Home 32bit, hangs).

But I agree with you, @geekosaur, about the PDF and HTML having different goals therefore the translation/conversion won't be exact (even with CSS applied to the HTML, perhaps) and actually the resultant HTML I've seen on some pages has formatting that is not the same but that will do.

So the 2nd part of the solution would be to use free application program tool IrfanView to convert from PDF to JPG, the PDF document being a series of JPG images, one for each document page. This is easy to setup, IrfanView view packages PDF conversion as part of its plug-in suite, and the pre-requisite for PDF is downloading GhostView, which IrfanView provides a link to. This works very well, except that during the process, the UI sometimes hangs but the conversion still proceeds.

http://en.irfanview-forum.de/vb/showthread.php?7689-Irfanview-freezes-during-PDF-to-JPG-conversion-if-you-try-to-continue-with-other-prog

To clarify on my goal, I wanted the pdf documents in a non-proprietary format which would afford me more possibilities for viewing the docs in the future. PDF is fairly ubiquitous though but I like my data to be free as in not tied to a format.

Thanks to other contributors:

Solution 2

PDF is a lousy input format for conversion, so "flakey" is petty much the rule. Some files can be converted relatively easily but most will have problems. (Very briefly: a PDF file is a compressed list of "move here, output this, move there, ...". If the document contains anything other than simple L-to-R text — tables, images, RTL text, footnotes, etc. — the conversion will probably produce some amount of garbage.)

Solution 3

There is a HTML Javascript based PDF renderer called PDF.js that uses the Canvas element. http://mozilla.github.com/pdf.js/web/viewer.html

It's under development but it might do the job for some.

Solution 4

'Gemini' from Iceni batch converts PDF documents to HTML...

http://www.iceni.com/gemini-features.htm

The output isn't 100% perfect but you might find it acceptable. And it's a good base to work from. If you're a perfectionist then some post-production 'search & replace' can usually iron out most issues.

Solution 5

I'd check if openoffice/libreoffice have command line flags for conversion.

PDFs suck for what you're trying to do. There is a huge Document model mismatch between how PDF sees a page and how HTML sees a page. There will be PDF files that just can't be converted easily to HTML by anything.

Share:
5,777

Related videos on Youtube

therobyouknow
Author by

therobyouknow

I enjoy making software and applying technology to help myself and friends and family achieve things as well as earning a living doing it. github.com/therobyouknow linkedin.com/in/therobyouknow twitter.com/therobyouknow

Updated on September 17, 2022

Comments

  • therobyouknow
    therobyouknow over 1 year

    I'm look for either a free or paid-for (about 50$/40pounds) BATCH PDF to HTML converter to convert several PDF files at once.

    Needs to be able to handle vectored and bitmap images within the file, outputting both as jpegs referenced by the html pages.

    I've tried iorigsoft paid-for PDF to HTML - problems it seems to hang or just go idle, and the stuff it actually converts have broken links - the wrong name is used for constituent chapters as html.

    Also tried application from intrapdf.com but this crashes near the beginning of the conversion, consitently.

    Update:

    intrapdf works on my Windows XP machine but not on my Windows 7 machine. The only glitch is with the framed index contents html - the graphics in the page do not display in the page in the frame - but if you open the frame only in a new tab then you can see them. That might be a browser glitch in chrome only.

    This solution is good enough for me - given that I've already spent the money (I had spent before I asked) but I can't accept my own answer as this does not work on Windows 7.

    Looked at opensource tools but they look equally flakey or use old PDF versions.

    Need it on Windows 7 32bit home.

    Thoughts?

    • Joel Coehoorn
      Joel Coehoorn about 13 years
      Just to warn you: "HTML" and "accurate" don't often belong in the same sentence.
    • 100rabh
      100rabh about 13 years
      if none of our solutions worked, you could post the one you used and marked it as an answer :)
  • Randolf Richardson
    Randolf Richardson about 13 years
    Additionally, PDFs may also contain portions of fonts, and replicating the font size on a web browser that's running on a computer that may not have these fonts installed is not going to result in the same appearance unless they are rendered to graphic images ahead-of-time.
  • therobyouknow
    therobyouknow about 13 years
    +1 worth checking, there is a python based command line suite for using Open Office to convert between formats.
  • therobyouknow
    therobyouknow about 13 years
    +1 looks good though they don't specify which Windows platforms prominently and the screenshots are XP but one can assume it works also for Windows 7. Also, they don't say if they deal with encrypted password protected documents, though I confess I didn't originally ask for this.
  • therobyouknow
    therobyouknow about 13 years
    +1 Looks good, it has got good ratings, deals with passwords and runs on Windows 7.
  • therobyouknow
    therobyouknow over 11 years
    Prefer something standalone on my machine that I have complete control of. With an online service there's a chance they retain the data.
  • Vadzim
    Vadzim over 11 years
    It's downloadable, not online.
  • Vadzim
    Vadzim over 11 years
    I'm not affiliated and don't know their business model. But it seems they aim at paid custom development and use free tools for promotion. BTW, I've first found it on CNET: download.cnet.com/Free-PDF-to-HTML/3000-10743_4-75732610.htm‌​l