Using the Wayback Machine Downloader to rebuild Dissociated Press

This domain has been online since January 2001. A homepage or, more often, some type of blog has been here almost as long. I’ve been, often, lackadaisical about continuity of content and posting. The kind folks at the Internet Archive, or rather their web scraping bots, have been far more attentive and consistent.

I’ve lost track, but I’ve probably wiped the slate clean and started over seven or eight times since I first registered dissociatedpress.net. Most of the restarts have been intentional, but a few were not. More than a few times I’ve wished to recover things I’d written here (and elsewhere), either to put them back on the web or just to have them as reminder how truly bad my writing used to be. (Yes, once upon a time, it was worse.)

Automating downloads from the Wayback Machine

The Wayback Machine is a truly wonderful service courtesy of the Internet Archive. You can find a lot of Internet history via the Wayback Machine, automatically crawled and tucked away to help counter the Web’s tendency towards bit-rot.

Run into a broken link on somebody’s site? Plop it into the “Browse History” slot on the Wayback machine and check the calendar to find the most relevant or recent backups of the page. (Note, also useful for checking changes to Web sites over time…)

This is a little tedious, though, if you want to grab all of a site. Which, I did, since I wanted to use some of the Wayback Machine’s resources to rebuild some of Dissociated Press (and other personal sites) from 2001 to present.

The Wayback Machine Downloader is a Gem

For bulk usage, the Wayback Machine Downloader is a real gem. I don’t just mean that like “wow, it’s really neat” (it is), I mean… it’s a Ruby Gem. It’s written in Ruby and you can install it as a gem, assuming you have Ruby 1.9.2 or later on your system (as of this writing). I’m using this on a recent Linux system, should also work on macOS, I think:

gem install wayback_machine_downloader

After install, if you run wayback_machine_downloader http://mysite.org it will grab the most recent version that’s archived.

Want all the versions? Use wayback_machine_downloader -s http://mysite.org instead. You can use the -d directory option to specify where you want files placed. Note that for sites with a lot of history, this may take quite a while. Grabbing dissociatedpress.net took several hours, while zonker.net only took 30 minutes or so.

You may only want captures between specific dates. In that case, the -f timestamp and -t timestamp options will allow you to specify files on or after the timestamp. The timestamp is something like 20050825073929 which you can parse out as YYYY MM DD HH MM SS, or in this case 25 August 2005 at 07:39:29. Poking around the Wayback Machine site will let you find specific snapshots and their times so you can pick timestamps for specific captures.

Note that you can truncate that to just the year, year+month, etc.

The Wayback Machine Downloader also allows grabbing specific URLs or types of files (using the --only option) and excluding URLs (the --exclude option) and more. See the GitHub page or run wayback_machine_downloader -h to see all the options.

Automating downloads from the Wayback Machine#

The Wayback Machine Downloader is a Gem#

Automating downloads from the Wayback Machine

The Wayback Machine Downloader is a Gem