How to convert a webpage to PDF with preserving its look (exactly as on web browser) and text/links?

107,211

Solution 1

We faced the same problem in a University project and were able to solve it using

wkhtmltopdf

We quite enjoyed the capabilities of this tool on the command line. We also called it using python code to render the current state of webpages. It has the option to deliver the webpage as pdf, usually not perfect to preserve the website view due to the Page formatting (A4 for example), or as png (preserves the view of the page but not links)

There is also the readability(for Python:pypi.python.org/pypi/readability-lxml) project we used that does the ads removal and content detection quite well (e.g. for newspaper articles and the like). If you just want an addon or extension for your browser the following readability implementation might satisfy your need:

Offline now: https://www.readability.com/addons/

WaybackMachine Link: https://web.archive.org/web/20160308192045/https://readability.com/addons

Solution 2

Contributing another answer for possible users. In Firefox, there used to be an addon "Print pages to PDF". You can search for its last version 0.1.9.3 (work on pre-Quantum versions only).

Currently there's this addon for both Chrome and Firefox that works quite well: PDFMage

  • Save all images in page
  • Generate text as text, not as image, you can search text in generated PDF.
  • Preserver hyperlinks
  • Has the option to save a long webpage as a one-page PDF (so the images are not split between pages)

Solution 3

I really struggled with this and tried most of the tools that are mentioned so far. The best results I got was using Chrome's headless mode. The command on MacOS would look like this:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --print-to-pdf=test.pdf http://127.0.0.1:8080

The best list of command line options I found was here.

However there were problems with that. Specifically my pages are very javascript heavy and I couldn't make the print function wait for them to finish execution. So my output didn't have the images in it.

The solution I found was a nodeJS package: chrome-headless-render-pdf. It's scant documentation is here. It works and it is easily scriptable.

Solution 4

I had the same problem, and figured it out via Chrome and with a free printer driver called PDF995. This is part of a suite of PDF utilities; the publisher's web site is http://www.pdf995.com/.

However, I think any web browser and any pdf converter will suffice. Anyway, here's what I did:

  1. select all or highlight everything.
  2. Right-click the highlighted selection or press Ctrl+P (both options give you slightly different results, but you end up with the same outcome after completion).

  3. If you right-clicked in 2., the selection (the short-cut), click "print" and only all that you've selected will be on the print preview. Make sure you change your printer destination to whatever pdf converter you decide to use (PDF995 or other).

  4. Click "print" and it saves as a pdf document.

  5. If you pressed Ctrl+P in 2. (the slightly longer way) instead, click on "More settings" and scroll down to "Options".

  6. Click the box that says "Selection only" and everything in the short-cut I described will follow.

  7. Don't forget to change your printer destination to whatever pdf converter you choose (PDF995 or other).

  8. Click "print".

Solution 5

If you're on Linux, try this small command line tool CutyCapt, which depends only on Qt and QtWebkit, and exports to PDF.

Share:
107,211

Related videos on Youtube

Omar
Author by

Omar

Updated on September 18, 2022

Comments

  • Omar
    Omar over 1 year

    I'm looking for a way to convert a webpage to PDF, but preserving the webpage's look. Also preserving webpage's text (being selectable), searchable [Generating image screenshot for the webpage would make text neither selectable nor searchable].

    I'm looking for printing the webpage to PDF as is (as on web browser) without any manipulation on style or alignment, or loss of any webpage's static components.

    This would help preserving offline copies of webpages that are easily readable, annotateable and searchable.


    You don't need to read any of below (Question is just the above section) in order to get my question. The following section is just listing of what I've got through research or others' answers in a nested way in order to reach an answer for the question.

    Research Outcomes (Suggestions that didn't solve my problem)

    Outcomes till now on trying to find a solution (All still not working as a solution for this question)

    I've tried these PDF web printing engines but all manipulate pages' look, more even damaging and making some hardly readable: (Example page screenshots are included in square brackets)

    • Chrome [Original, Print Styles (Disabled | not Disabled)]
    • Firefox [Original, Print Styles (Disabled p1,p2 | not Disabled p1,p2)]
    • Readability
      • It simplifies the webpage (which is a good thing for focused reading–However, this isn't what I'm looking for). I'm looking for keeping all the webpage's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
    • Foxit Reader
    • NovaPDF
    • CutyCapt [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
      • I'll add links after I solve program's running issues on Windows"
    • wkhtmltopdf [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
      • It doesn't support CSS3.

    All webpage screenshot image capturing plugins (e.g. Abduction, Awesome Screenshot, Fireshot, Firefox Screenshot Developer Tool, Full Page Screen Capture, Page2Images, web-capture, ...) don't answer my question, because they don't preserve text and links.

    Scrible is great at preserving webpages as is for further annotation and research, but unfortunately still online and without conversion to PDF format.

    There are two other questions on the community similar somehow to mine, however, this one is different a little bit but with those important distinctions:

    More Similar questions where preserving text and links isn't a requirement (pages are captured as image screenshots mostly):


    Notes

    OS: Windows 10

  • Omar
    Omar about 8 years
    Saving the webpage in .html format would make it not-annotateable. So, I need it in PDF format.
  • Pyheme
    Pyheme about 8 years
    That's a good point! Just remembered of an extension that allows you to easily disable print-related stylesheets. A quick google search led me to the discussion when I had first heard of it, on Superuser: How to get WYSIWYP (print what you see) in a web browser?
  • Omar
    Omar about 8 years
    Unfortunately, wkhtmltopdf didn't preserve page's elements positions. Example Page: Zoom Factor: 0.4: Screenshots, Outputted PDF
  • Omar
    Omar about 8 years
    Readability simplifies the page (which is a good thing–However this isn't what I'm looking for). I need to keep all the page's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
  • sebisnow
    sebisnow about 8 years
    Did you use the wkhtmltopng option of the tool, as png the positions should be okay (at least much better than in the pdf version where the page is fitted to A4 format)
  • fixer1234
    fixer1234 over 7 years
    This preserves links, but not selectable text, which is a requirement in the question.
  • David Herse
    David Herse over 7 years
    Seems to be selectable for some sites. I think it depends what sort of custom font the site uses.
  • SherlockSpreadsheets
    SherlockSpreadsheets over 5 years
    I tried doing "Save As" using Chrome. It creates a .HTML file and a folder. The .HTLM file was missing a whole lot of stuff from the page.
  • jeppoo1
    jeppoo1 about 4 years
    @sebisnow Is the readability.com site deprecated? I can't access it at the moment.
  • sebisnow
    sebisnow about 4 years
    yes, seems to be offline for at least a year already. I will add a wayback machine link. web.archive.org/web/20160308192045/https://readability.com/…
  • PS Nayak
    PS Nayak almost 4 years
    The link does not work. You should remove this answer.
  • Finn Årup Nielsen
    Finn Årup Nielsen about 3 years
    wkhtmltopdf does not seem to handle iframes
  • sebisnow
    sebisnow about 3 years
    There exists an open (quite old) issue: github.com/wkhtmltopdf/wkhtmltopdf/issues/2010. In a related issue they mention a workaround by explicitly setting widht and hight of iframe in order to render it correclty github.com/wkhtmltopdf/wkhtmltopdf/issues/1685
  • Dude named Ben
    Dude named Ben almost 3 years
    Excellent addon. Thank you.
  • Dude named Ben
    Dude named Ben almost 3 years
    Headless chrome works but generates horrible output.
  • Ricardo
    Ricardo about 2 years
    wkhtmltopdf does NOT work properly. I am a Browser War Veteran (remember 2006-09?) and that little piece of... tool gives me flashbacks. It will NOT understand page breaks, it will NOT print table gridlines thinner than 1mm (🤮) and it will NOT balance table line heights, keeping a fixed height and then dumping the remaining height on the last line. It is only useful if you go back to 1994 and print from NCSA Mosaic. I'm trying to use Selenium, headless browser and print to PDF. My solution will appear here if I ever make it run. I'm throwing in the kitchen sink and even pandoc.
  • Admin
    Admin about 2 years
    @Ricardo I remember those days well. The last release does much better with escape characters, but as I said in my response you do need some pre-processing as wkhtmltopdf doesn't recognize newer/custom DOM elements. What I've had to do in the headless situation is have a script that modifies the HTML file (removes header and all non-necessary elements, replacing with common ones.) It is a very similar script to what browsers use for 'reading mode', which in my experience prints absolutely fine with wkhtmltopdf. Let me see if I can find the script that's been working for me to add.