Archiving the Web

July 16, 2012 at 4:51 pm | Posted in Internet, Media, Online services, Software | Leave a comment

If you’re a researcher or writer, you’ve probably run into the issue of a web reference vanishing. Some sites treat articles as temporary. It’s one of the reasons reference standards for web citations include the date found. But digital information is so much superior to paper. The key is just in properly capturing that passing data.

Tools like Evernote for capturing snippets of the web have been around for years. It captures content with links, gathering the bits in a common scroll.

I found I liked to capture the context as well, so I tended to save the whole page or article. Web browsers all allow you to simply File/Save Page As or File/Save As. It will typically use the Page Title tag as the filename. However, what you end up with is an HTML file and a name-matched folder of parts – the images, css, and so forth called by the page. Because Windows sorts folders and files separately, it’s a little too easy for the parts to get separated. If you move the file and not the folder, rename the folder, rename the file (thus visually disassociating it from the folder), or whatever, the link can be easily broken. Long titles can create problems for backup and burning software too.

Web Archive formats to the rescue. The first I discovered years ago was the .MHT format. This is a Microsoft-developed format that archives the page and its parts in an open-standard (surprise!) zip file. It opens normally in IE on a double click. To use it, just choose Web Archive from the Save As options in IE and voila, your page is packed in a single file.

However, as I used Firefox for most browsing, this was inconvenient. I happily discovered someone had built a plug-in for saving pages to MHT in Firefox. I used it for years, even editing it after it ceased being updated to keep it working.

Eventually, I came to print pages to PDF. With the right Add-on, this is as simple as File, Print to PDF. (PDF’s are “printed” as they use the post-script print engine to convert to a PDF file)

There are a number of related tools like Print Edit, which allows you to delete unnecessary parts before saving or printing.

(as an aside, for PDFs I’ve become a big fan of the free Foxit Reader. It’s way faster than Acrobat Reader and doesn’t have the security issues.)

However, PDFs are designed for Office documents, books and such. Some web pages don’t sit very well in a paper layout and other features can be broken, like embedded media.

Recently, I discovered the new Add-on Mozilla Archive Format. It allows you to save in both the MHT format mentioned above (and have them open in Firefox) and in the Mozilla Archive Format (MAFF). For MAFF, the saved page is “faithful” to the original. The format is compressed so uses less disc space, can include embedded video & audio, is an open standard and also uses universal ZIP. It even allows you to save multiple tabs in a single file. For either MHT or MAFF, it displays where you saved it from and when; perfect for references. And you can convert to other formats. As a Zip file, if you want to extract just one part like an image, you can use a Zip tool like the free 7-Zip.

Of course, no discussion of “web archives” would be complete without mentioning the Internet Archive. It’s a massive site that is archiving portions of the web. The WaybackMachine, for example, will show you old versions of many web sites. They’re also archiving many old texts, live music, and so much more. What they have is a little random but some is amazing.

May your own archive become quite the treasure also.
David

PS – don’t forget to have a Backup of those nice archives.

Leave a Comment »

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.
Entries and comments feeds.