Convert PDF to Word in Java

17,990

Solution 1

Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.

Solution 2

It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).

There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.

So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).

Share:
17,990
user121196
Author by

user121196

Updated on June 07, 2022

Comments

  • user121196
    user121196 almost 2 years

    Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.