Converting DJVU to PDF
Solution 1
Method 1
Simply use DJView and export as PDF
- Goto Synaptic Package Manager
- Install DJview4
- Run DJview (Applications - Graphics - DJView4)
- Open your .djvu document
- : Menu - Export As: PDF
Method 2
Open the djvu file in evince
Select print ----> print to file
change .ps to .pdf and click print
Method 3
- Goto Synaptic Package Manager
-
Install
djvulibre-bin libdjvulibre21 okular-extra-backends evince libevdocument3 libevview3
-
Goto terminal and write
sudo apt-get install libtiff-tools
Goto the directory where the djvu file is present. Click the right mouse button. Goto “Open In Terminal” option. Click on it. A terminal will open.
-
In that terminal write
ddjvu -format=tiff file_name.djvu file_name.tiff tiff2pdf -j -o file_name.pdf file_name.tiff
Method 4
There is also an online converter DjVu to PDF converter
Solution 2
Here is one way, which would require some not so common tools:
We can use djvu2hocr
command (from ocrodjvu
package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:
djvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html
sed
intervention corrects class names in output hOCR (which is just simple HTML file)
Now we extract DjVu page to TIFF format with:
ddjvu -format=tiff -page=10 sample.djvu pg10.tif
so that we end with these file in out work folder:
sample.djvu
pg10.html
pg10.tif
This is where pdfbeads
comes in play, and we simple execute:
pdfbeads -o pg10.pdf
then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products:
sample.djvu
pg10.html
pg10.tif
pg10.jbig2
pg10.pdf
pg10.sym
which is identical to input DjVu file and has text layer inside:
Comments summary:
Lengthy comments below discuss representing smaller images from DjVu document page as separate objects, which is not easily possible because DjVu document page is itself just a single image with optional text layer, with no "information" about smaller images as separate objects. If DjVu document has color images, then they'll be usually placed on background layer; in this case user can take advantage of tools like ddjvu
(extract only background layer) and imagemagick
(auto-crop) to output just images instead whole canvas, but it can't be automated for creating PDF output
Another saner, but slower approach is use of regular OCR GUI tools. gscan2pdf
(> 1.0) is suggested as possible candidate for Linux PC
Solution 3
There is djvu2pdf but it relies on ghostscript so it might be another printing option. I still suggest you give it a look, just in case it's more clever than I'm giving it credit.
It's not in the repos but you can download a deb from the makers' site: http://0x2a.at/s/projects/djvu2pdf
** Insert mandatory notice about downloading/installing things from outside the repos here **
Solution 4
Using DJVULibre, one can extract the text layer via the terminal
command:
djvutxt myfile.djvu > myfile-ocr.txt
or djvused myfile.djvu -e 'print-pure-txt' > myfile.txt
(both do the same thing, and were found here)
Formatting requires some effort (as many symbols are not converted properly) and pictures are not recovered.
Solution 5
I made a script of @zetah's answer.
It is available here: https://gist.github.com/matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b
![hayd](https://i.stack.imgur.com/RN9Mw.jpg?s=256&g=1)
hayd
"... there is no such word as 'impossible' in my dictionary. In fact, everything between 'herring' and 'marmalade' appears to be missing." — Svlad Cjelli Make it a Short, Self Contained, Correct (Compilable), Example. Always learning. github: https://github.com/hayd careers: http://careers.stackoverflow.com/hayd
Updated on September 18, 2022Comments
-
hayd almost 2 years
I want to convert a DJVU document into a PDF document, separating and preserving the text layer and the images while also keeping the structure from the DJVU. How can I do this in Ubuntu?
(I will then be using Calibre to convert to ePub/Mobi, so if there were a Calibre plug-in for this entire process that would be perfect for me!)
Note1: Printing from Evince, exporting from DJview, or anything using the package ddjvu, are not adequate solutions as they discard the text layer, saving only images.
Note2: Using DJVULibre seems to only extract the text layer and pictures are not extracted. Similarly, copying the text "manually" loses the both document structure and the pictures.
-
Nathaniel M. Beaver over 4 yearsFor what it's worth, I filed a feature request for ddjvu here: sourceforge.net/p/djvu/feature-requests/98
-
-
hayd about 13 yearsI'm afraid djvu2pdf uses ddjvu to export to PDF, which exports images without text.
-
hayd about 13 yearsThis is good for converting picture-less books in DJVU format, but not for documents with pictures. This is the current solution for me at the moment, and the only one to extract the text. A way to preserve formatting and pictures would be much preferred!
-
hayd about 12 years@Ashu Are you sure this retrieves the pictures?
-
Ashu about 12 yearsYeah method 1 and 2 did work for me . didnt try for 3 and .4
-
Ashu about 12 yearsi have already posted that website bro
-
hayd about 12 yearsAm I correct in thinking that this doesn't extract the individual picture data, but only the image of the entire page?
-
zetah about 12 yearsWhat do you mean by "individual picture data" when you refer to DjVu file structure?
-
hayd about 12 yearswhether it can crop the pictures out of the document as smaller images placed on top of the PDF (e.g. so they could export to HTML)
-
zetah about 12 yearsThere is no such definition in DjVu file structure. Above example image in original DjVu document is "placed" on foreground layer/mask together with characters image and there is separate text layer which was extracted as explained. If DjVu document has color image(s) they will be placed on background layer across whole page (in common compound DjVu file). While it's understandable that you may expect that images in DjVu document page are separate objects they are not - look at DjVU document page as single image with optional text layer, that's basically what is it.
-
zetah about 12 yearsNo, he couldn't make proper PDF out of it. It's just suggestion for different approach - make EPUB XHTML out from hOCR, and if images are in color then he could autocrop. All that will depend on XSLT used, in case he goes in that route.
-
hayd about 12 years@zetah but can't you make proper PDFs using the smaller images (wouldn't this then reduce the size of the PDF?), the autocrop stuff sounds exactly like what I'm looking for!
-
hayd about 12 years@zetah calibre can convert PDFs with the images cropped really well, but it can't handle where the entire page is just one image. This is really the key/difficult part of the problem! If you think this could be achieved that would be really interesting to me.
-
zetah about 12 yearsIt's not possible. I think I briefly explained what is DjVu - it's just image, and it's totally different then PDF. You'll have to consider image DPI, geometry and position, and even if you deduce it, you'll have to use some serious code to make PDF out of it. As I told you, you don't need to move whole mountain to Calibre.
-
hayd about 12 years@zetah I was hoping there might be a tool which does this - so that others had done the "serious code" :). Certainly I agree the cropping of the images is the difficult bit. (but I thought you suggested you could autocrop with 'imagemagick'? if the solution could export to HTML that'd be perfect!)
-
zetah about 12 yearsYes, you can use imagemagick to autocrop image(s) from DjVu image, only if it's in color, but what will you do with it? "Serious code" kind of work is done in OCR GUIs - "Finereader", "Readiris"... ($ Windows) or even "gscan2pdf" (> 1.0) I think offers this feature. Even then it's not fully automatic, as user needs to confirm that image detection is correct. I'm not aware of easy solution that would suit you. It's very specific, as for some reason I can only guess, you are left with DjVu files, which you want to convert in specific PDF format, so that Calibre can convert "properly".
-
Jorge Castro about 12 yearsOk this seems to be the closest we're going to get, bounty goes to zetah. If you guys can add anything that would be useful to the next person in the answer so it's not buried in the comments it would be swell.
-
hayd almost 12 yearsIt doesn't (retrieve the images or text).
-
corev over 11 yearsThis seems a fake site. I get this message after conversion: I'm sorry, you may not download that file.
-
Tim almost 10 yearsThanks, @zetah! (1) Is pdfbeads only able to work on a single page tiff? When converting a multipage bundled djvu file to a multipage pdf file, do we have to do what you said on each page separately, and then combine the single-page pdf files together? (2) the bookmarks in the original multipage bundled djvu file will be lost in the multipage pdf file, correct?
-
zetah almost 10 years@Tim this is old post. (1) hocr can reference multiple pages, but is this implemented in pdfbeads, I don't know so you'll have to try and see. (2) Bookmarks will be lost sure. Possible solution: use bmcconverter to convert djvu bookmarks to pdftk bookmarks, then use this script to convert pdftk bookmarks to pdfmarks, and finally use Ghostscript to write bookmarks in pdfmarks to pdf file example.
-
Tim over 9 years@zetah: Thanks. I understand those now. Somehow different but related question: If I have a pdf file and a html file in hocr format. Can I merge the hocr file into the pdf file, to make the pdf file searchable, without converting the pdf file to single-page image files? See unix.stackexchange.com/questions/170133/…
-
NGRhodes about 9 yearsHi, to make this a more useful anwer could you give a little more detail about where to obtain and use gscan2pdf and tesseract.
-
Alexey over 6 yearsAbout "Method 2": changing the extension from .ps to .pdf does not change anything, Evince still produces the same Postscript file (tested on Ubuntu 17.10).
-
Alexey almost 5 yearsThe text layer is lost with method 1 (i suppose with the others too).
-
matthieu almost 5 yearsThanks a lot ! I made a script from your answer : gist.github.com/matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b
-
rbrito about 4 yearsVery good summary of what is in that post. Thanks for this script!
-
HappyFace almost 4 years
brew install djvu2pdf
-
robertspierre about 3 yearsHere Method 1 generates a PDF file that is 80 times bigger than the original DJVU file and ... empty.