What's the best "file format" for saving complete web pages (images, etc.) in a single archive?

html standards webpage archive

23,352

Solution 1

My favourite is the ZIP format. Because:

It is very well sutied for the purpose
It is well documented
There a a lot of implementations available for creating or reading them
A user can easily extract single files, change them and put them back in the archive
Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in

The alternatives all have some flaw:

With MHTMl, you can not easily edit.
With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.

Solution 2

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

store whole page as it is with all referenced resources - images, CSS and javascript?
to capture page as it was rendered at some point in time; a static image of some rendered state of web page DOM?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

one page is in fact several pages build dynamically by JS, user interaction is needed to get it to desired state
AJAX applications can do remote communication with remote service rendering it unusable for offline view.
Hidden links in javascript code. Such resource is then not part of stored page. Even parsing JS code may not discover them. You need to run the code.
Even position of basic html elements may be recomputed may be computed dynamically by JS and it is not always possible/easy to recreate it locally.
You would need some sort of JS memory dump and load this to get page to desired state you hoped to store

And many many more issues...

Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

Solution 3

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

Solution 4

Use a zip file.

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

Solution 5

i see no excuse to use anything other than a zipfile

View more solutions

23,352

Author by

Admin

Updated on October 21, 2020

Comments

Admin over 3 years
I'm working on a project which stores single images and text files in one place, like a time capsule. Now, most every project can be saved as one file, like DOC, PPT, and ODF. But complete web pages can't -- they're saved as a separate HTML file and data folder. I want to save a web page in a single archive, and while there are several solutions, there's no "standard". Which is the best format for HTML archives?
- Microsoft has MHTML -- basically a file encoded exactly as a MIME HTML email message. It's already based on an existing standard, and MHTML as its own was proposed as rfc2557. This is a great idea and it's been around forever, except it's been a "proposed standard" since 1999. Plus, implementations other than IE's are just cumbersome. IE and Opera support it; Firefox and Safari with a cumbersome extension.
- Mozilla has Mozilla Archive Format -- basically a ZIP file with the markup and images, with metadata saved as RDF. It's an awesome idea -- Winamp does this for skins, and ODF and OOXML for their embedded images. I love this, except, 1. Nobody else except Mozilla uses it, 2. The only extension supporting it wasn't updated since Firefox 1.5.
- Data URIs are becoming more popular. Instead of referencing an external location a la MHTML or MAF, you encode the file straight into the HTML markup as base64. Depending on your view, it's streamlined since the files are right where the markup is. However, support is still somewhat weak. Firefox, Opera, and Safari support it without gaffes; IE, the market leader, only started supporting it at IE8, and even then with limits.
- Then of course, there's "Save complete webpage" where the HTML markup is saved as "savedpage.html" and the files in a separate "savedpage_files" folder. Afaik, everyone does this. It's well supported. But having to handle two separate elements is not simple and streamlined at all. My project needs to have them in a single archive.
Keeping in mind browser support and ease of editing the page, what do you think's the best way to save web pages in a single archive? What would be best as a "standard"? Or should I just buckle down and deal with the HTML file and separate folder? For the sake of my project, I could support that, but I'd best avoid it.
Admin over 15 years

DUH! Why didn't I think of that? Yeah, PDF is used by everyone and their mother to share documents. It's not easy to edit without tools, but what's more important is the browser support. 'Specially if I coupled PDF with another solution, it turns out ideal. Thanks!
Admin over 15 years

A very creative answer. You're very right in using a ZIP file and then extracting to a temp dir for my project. I might end up doing that. Good advice on the other formats as well. Thanks!
Admin over 15 years

Excellent advice, these suggestions point me in the right direction. Thanks!
UnkwnTech over 15 years

Depending on the impementation you may not even have to extract it to a temp directory, I know that in PHP I can directly read the contents of a ZIP on the fly so I would not have to extract to a temp file, however this will increase CPU load a bit.
cavalcade over 8 years

Just curious, by ZIP did you mean standalone ZIP, or Mozilla Archive Format based on ZIP?