Converting DJVU to PDF

49,735

Solution 1

Method 1

Simply use DJView and export as PDF

  1. Goto Synaptic Package Manager
  2. Install DJview4
  3. Run DJview (Applications - Graphics - DJView4)
  4. Open your .djvu document
  5. : Menu - Export As: PDF

Method 2

Open the djvu file in evince
Select print ----> print to file
change .ps to .pdf and click print

Method 3

  1. Goto Synaptic Package Manager
  2. Install

    djvulibre-bin libdjvulibre21 okular-extra-backends evince libevdocument3 libevview3

  3. Goto terminal and write

     sudo apt-get install libtiff-tools
    
  4. Goto the directory where the djvu file is present. Click the right mouse button. Goto “Open In Terminal” option. Click on it. A terminal will open.

  5. In that terminal write

    ddjvu -format=tiff file_name.djvu file_name.tiff
    tiff2pdf -j -o file_name.pdf file_name.tiff
    

Method 4

There is also an online converter DjVu to PDF converter

Solution 2

Here is one way, which would require some not so common tools:

  1. ocrodjvu
  2. pdfbeads, that has it's own requirements which can be found by Google

We can use djvu2hocr command (from ocrodjvu package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:

djvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html

sed intervention corrects class names in output hOCR (which is just simple HTML file)

Now we extract DjVu page to TIFF format with:

ddjvu -format=tiff -page=10 sample.djvu pg10.tif

so that we end with these file in out work folder:

sample.djvu
pg10.html
pg10.tif

This is where pdfbeads comes in play, and we simple execute:

pdfbeads -o pg10.pdf

then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products:

sample.djvu
pg10.html
pg10.tif
pg10.jbig2
pg10.pdf
pg10.sym

which is identical to input DjVu file and has text layer inside:

enter image description here

Comments summary:

Lengthy comments below discuss representing smaller images from DjVu document page as separate objects, which is not easily possible because DjVu document page is itself just a single image with optional text layer, with no "information" about smaller images as separate objects. If DjVu document has color images, then they'll be usually placed on background layer; in this case user can take advantage of tools like ddjvu (extract only background layer) and imagemagick (auto-crop) to output just images instead whole canvas, but it can't be automated for creating PDF output

Another saner, but slower approach is use of regular OCR GUI tools. gscan2pdf (> 1.0) is suggested as possible candidate for Linux PC

Solution 3

There is djvu2pdf but it relies on ghostscript so it might be another printing option. I still suggest you give it a look, just in case it's more clever than I'm giving it credit.

It's not in the repos but you can download a deb from the makers' site: http://0x2a.at/s/projects/djvu2pdf

** Insert mandatory notice about downloading/installing things from outside the repos here **

Solution 4

Using DJVULibre, one can extract the text layer via the terminal command:

djvutxt myfile.djvu > myfile-ocr.txt or djvused myfile.djvu -e 'print-pure-txt' > myfile.txt

(both do the same thing, and were found here)

Formatting requires some effort (as many symbols are not converted properly) and pictures are not recovered.

Solution 5

I made a script of @zetah's answer.

It is available here: https://gist.github.com/matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b

Share:
49,735
hayd
Author by

hayd

"... there is no such word as 'impossible' in my dictionary. In fact, everything between 'herring' and 'marmalade' appears to be missing." — Svlad Cjelli Make it a Short, Self Contained, Correct (Compilable), Example. Always learning. github: https://github.com/hayd careers: http://careers.stackoverflow.com/hayd

Updated on September 18, 2022

Comments

  • hayd
    hayd almost 2 years

    I want to convert a DJVU document into a PDF document, separating and preserving the text layer and the images while also keeping the structure from the DJVU. How can I do this in Ubuntu?

    (I will then be using Calibre to convert to ePub/Mobi, so if there were a Calibre plug-in for this entire process that would be perfect for me!)

    Note1: Printing from Evince, exporting from DJview, or anything using the package ddjvu, are not adequate solutions as they discard the text layer, saving only images.

    Note2: Using DJVULibre seems to only extract the text layer and pictures are not extracted. Similarly, copying the text "manually" loses the both document structure and the pictures.

  • hayd
    hayd about 13 years
    I'm afraid djvu2pdf uses ddjvu to export to PDF, which exports images without text.
  • hayd
    hayd about 13 years
    This is good for converting picture-less books in DJVU format, but not for documents with pictures. This is the current solution for me at the moment, and the only one to extract the text. A way to preserve formatting and pictures would be much preferred!
  • hayd
    hayd about 12 years
    @Ashu Are you sure this retrieves the pictures?
  • Ashu
    Ashu about 12 years
    Yeah method 1 and 2 did work for me . didnt try for 3 and .4
  • Ashu
    Ashu about 12 years
    i have already posted that website bro
  • hayd
    hayd about 12 years
    Am I correct in thinking that this doesn't extract the individual picture data, but only the image of the entire page?
  • zetah
    zetah about 12 years
    What do you mean by "individual picture data" when you refer to DjVu file structure?
  • hayd
    hayd about 12 years
    whether it can crop the pictures out of the document as smaller images placed on top of the PDF (e.g. so they could export to HTML)
  • zetah
    zetah about 12 years
    There is no such definition in DjVu file structure. Above example image in original DjVu document is "placed" on foreground layer/mask together with characters image and there is separate text layer which was extracted as explained. If DjVu document has color image(s) they will be placed on background layer across whole page (in common compound DjVu file). While it's understandable that you may expect that images in DjVu document page are separate objects they are not - look at DjVU document page as single image with optional text layer, that's basically what is it.
  • zetah
    zetah about 12 years
    No, he couldn't make proper PDF out of it. It's just suggestion for different approach - make EPUB XHTML out from hOCR, and if images are in color then he could autocrop. All that will depend on XSLT used, in case he goes in that route.
  • hayd
    hayd about 12 years
    @zetah but can't you make proper PDFs using the smaller images (wouldn't this then reduce the size of the PDF?), the autocrop stuff sounds exactly like what I'm looking for!
  • hayd
    hayd about 12 years
    @zetah calibre can convert PDFs with the images cropped really well, but it can't handle where the entire page is just one image. This is really the key/difficult part of the problem! If you think this could be achieved that would be really interesting to me.
  • zetah
    zetah about 12 years
    It's not possible. I think I briefly explained what is DjVu - it's just image, and it's totally different then PDF. You'll have to consider image DPI, geometry and position, and even if you deduce it, you'll have to use some serious code to make PDF out of it. As I told you, you don't need to move whole mountain to Calibre.
  • hayd
    hayd about 12 years
    @zetah I was hoping there might be a tool which does this - so that others had done the "serious code" :). Certainly I agree the cropping of the images is the difficult bit. (but I thought you suggested you could autocrop with 'imagemagick'? if the solution could export to HTML that'd be perfect!)
  • zetah
    zetah about 12 years
    Yes, you can use imagemagick to autocrop image(s) from DjVu image, only if it's in color, but what will you do with it? "Serious code" kind of work is done in OCR GUIs - "Finereader", "Readiris"... ($ Windows) or even "gscan2pdf" (> 1.0) I think offers this feature. Even then it's not fully automatic, as user needs to confirm that image detection is correct. I'm not aware of easy solution that would suit you. It's very specific, as for some reason I can only guess, you are left with DjVu files, which you want to convert in specific PDF format, so that Calibre can convert "properly".
  • Jorge Castro
    Jorge Castro about 12 years
    Ok this seems to be the closest we're going to get, bounty goes to zetah. If you guys can add anything that would be useful to the next person in the answer so it's not buried in the comments it would be swell.
  • hayd
    hayd almost 12 years
    It doesn't (retrieve the images or text).
  • corev
    corev over 11 years
    This seems a fake site. I get this message after conversion: I'm sorry, you may not download that file.
  • Tim
    Tim almost 10 years
    Thanks, @zetah! (1) Is pdfbeads only able to work on a single page tiff? When converting a multipage bundled djvu file to a multipage pdf file, do we have to do what you said on each page separately, and then combine the single-page pdf files together? (2) the bookmarks in the original multipage bundled djvu file will be lost in the multipage pdf file, correct?
  • zetah
    zetah almost 10 years
    @Tim this is old post. (1) hocr can reference multiple pages, but is this implemented in pdfbeads, I don't know so you'll have to try and see. (2) Bookmarks will be lost sure. Possible solution: use bmcconverter to convert djvu bookmarks to pdftk bookmarks, then use this script to convert pdftk bookmarks to pdfmarks, and finally use Ghostscript to write bookmarks in pdfmarks to pdf file example.
  • Tim
    Tim over 9 years
    @zetah: Thanks. I understand those now. Somehow different but related question: If I have a pdf file and a html file in hocr format. Can I merge the hocr file into the pdf file, to make the pdf file searchable, without converting the pdf file to single-page image files? See unix.stackexchange.com/questions/170133/…
  • NGRhodes
    NGRhodes about 9 years
    Hi, to make this a more useful anwer could you give a little more detail about where to obtain and use gscan2pdf and tesseract.
  • Alexey
    Alexey over 6 years
    About "Method 2": changing the extension from .ps to .pdf does not change anything, Evince still produces the same Postscript file (tested on Ubuntu 17.10).
  • Alexey
    Alexey almost 5 years
    The text layer is lost with method 1 (i suppose with the others too).
  • matthieu
    matthieu almost 5 years
    Thanks a lot ! I made a script from your answer : gist.github.com/matthieuheitz/7287e214b1aeda7948f6c27fbfb528‌​8b
  • rbrito
    rbrito about 4 years
    Very good summary of what is in that post. Thanks for this script!
  • HappyFace
    HappyFace almost 4 years
    brew install djvu2pdf
  • robertspierre
    robertspierre about 3 years
    Here Method 1 generates a PDF file that is 80 times bigger than the original DJVU file and ... empty.