web (dot) mit (dot) edu

As I’ve spent time looking over portions of the http://www.mit.edu domain, I’ve noticed that some websites are located at web.mit.edu and some are mit.edu. Just based on looks, the web.mit.edu websites seemed to be older and as sites were updated the URL was also updated. But why was web.mit.edu ever in use? Well, a librarian colleague who has been part of the MIT community for many years helped solve this mystery for me!

The story goes that when the World Wide Web arrived on the scene in the 1990’s the MIT student group SIPB snagged www.mit.edu URL right away! SIPB, which is a volunteer student computing group (around since 1969), created a wonderful site that you can view via the Internet Archive (snapshot from 1997).

It’s hard to say if this IA playback of the site is completely accurate in design, but the information is fun to look through (like this timeline – web fever has hit!). Only later did the group give over the www.mit.edu domain to MIT… thus the mix of web.mit.edu and mit.edu URLs.  I don’t know the exact date when MIT started using http://www.mit.edu as the hompage URL (or at least redirecting http://www.mit.edu to web.mit.edu), but in the Wayback Machine the change seems to occurs around late 1999 – 2000.

Web history, it’s fun!

While perusing the archived webpages, I noticed that the MIT homepage used to featured some really fun and pretty designs and logos. Sometimes the homepage was designed by someone from the MIT community. This isn’t something the current website does. So glad IA captured the homepage over the years.


web archiving resources for NDSA NE crew (and anyone else reading this!)

This list of resources is shared as a compliment to a presentation I gave at the NDSA New England meeting on September 25, 2015. The presentation discussed the MIT Institute Archives’ efforts to acquire websites without a hosted service. I talked about how technology is important, but policy development and planning are key activities that can be accomplished even if new technology isn’t possible right away. The presentation also highlighted the tools we’re finding useful that are easy for an archivist with limited programming skills to use (web recorder, wget and web archive player). I’ve previously talked about some of these activities on ArchiveHour, see that post here.

P.S. Every time I think I’ve got a handle on the essential web archiving resources, I find out about something new. I also realize that a lot of work has gone into web archiving development long before it was something I first learned about in 2013. With this in mind, it’s quite possible that a lot of good stuff is missing from the following list — please add resources you love in the comments or alert me of my ignorance via contact page. =) thank you!

Get Started

  • International Internet Preservation Consortium (IIPC) website – What is web archiving?
  • IIPC blog post (2015), Ian Milligan – “So You Want to Get Started In Web Archiving?” Provides an excellent list of blogs to follow.
  • Archive-It Web Archiving Life Cycle – the examples are specific to Archive-It service and partners, but in any case the life cycle breakdown and concepts are helpful to think about the range of activities and policy that go into a web archiving program.
  • DPC Technology Watch 13-01, (2013), Maureen Pennock “Web-Archiving”
  • NDSA 2013 Web Archiving in the United States survey report

Continue reading

reading notes: the tough stuff

This month I chose three readings that are rather different, yet each takes a look at some of the tough stuff that comes up in the information profession — collaboration, digital preservation and web archives, and e-waste and ethical consumerism.

1… The first is a report from OCLC by Jackie Dooley addressing management of born-digital library material. When it comes to navigating born-digital content, digitized materials, digitally published and delivered content, and open web based content — the best course for acquisition, access, and preservation actions is not always clear or simple.

Continue reading

fellowship update: getting on the same (web) page

“Web Archiving” DCP SPRUCE Digital Preservation Illustrations

Web archiving is a new endeavor for the MIT Institute Archives and Special Collections (IASC) and I am lucky enough to be able to take the lead on developing a website acquisition process for the archives. As with any other initiative, the work involves a lot of collaboration with colleagues. The following sections highlight some of the activities currently in progress for this project….

Outline; or, evolving list(s) of next step activities

I’ve created a loose project outline that is basically an evolving list of activities grouped into some categories (e.g. making the case for web archiving, collaboration and communication, policy and procedures, planning for acquisition and metadata integration, web acquisition tools and services, access). The tasks and categories don’t necessarily represent a strictly linear process, but help me remember the wide range of elements that help to create a thorough web acquisition workflow. (This blog post, for example, fits within communication!)

In order to make the case for web archiving and set groundwork for moving forward, the first task was to write an informational document that defines types of web archiving and explores the IASC’s vision for how web archiving can be part of the digital archives program. The following is an excerpt from the document:

The International Internet Preservation Consortium (IIPC) states that archiving internet content is “the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use.” In general, the goal of a web archiving program should be to “capture and preserve the dynamic and functional aspects of Web pages – including active links, embedded media, and animation – while also maintaining the context and relationships between files” (Antracoli, et al. 2014). The use of tools, software or hosted services to collect (also referred to as copy, harvest, or crawl) websites is an essential technical step of a web archiving program, but selection and use of a tool(s) shouldn’t be the first or only step of a holistic web archiving effort.

The document continues on to explore some policy and visioning considerations that should ideally come before tool selection or acquisition. One of the most important steps is understanding the purpose of collecting websites. Two common types of web archive initiatives include: general collections/subject area development and archival records collections. For the IASC, the approach for web acquisition is within the area of archival records. This means that we are interested primarily in websites that are records of MIT (mostly, the mit.edu domain). Beyond this initial document describing web archiving, we’ll need to continue to document things like vision and scope, appraisal guidelines (which are, of course, informed by existing collection policy), rights and permissions procedures, workflow and frequency of capture, access for researchers, and preservation. We’re using the Archive-It Web Archiving Life cycle model to help structure the planning and documentation development.

Appraisal of mit.edu domain; or, understanding the place of website records 

In addition to exploring a web archiving vision for IASC, I’ve also begun to survey parts of the www.mit.edu domain for sites that are good candidates for acquisition (intellectually, not necessarily technically). In some ways this process is a gap analysis as I consider things like: does this website fit an existing collection? when did IASC last received materials from the office or department? does IASC have any digital content for this collection currently? is this website a replacement of the physical record types that IASC used to receive? It’s not realistic to expect to cover the entire domain with this method, but I still think it’s worth it to spend some time appraising (and documenting appraisal) of a range of websites within the domain. I hope that this process can help us prioritize websites to capture and better understand how websites fit into our collections and finding aids. Throughout this process, I’ve been checking in with Liz, our Archivist for Collections, as her knowledge of the collections and organizational history of the Institute is invaluable!

Technology exploration; or, how this project plan isn’t linear! 

It seems like most U.S. archives and libraries engaged in web archiving are using a hosted service. And that is probably because web acquisition is so complex and difficult to do at scale (e.g. all of mit.edu). MIT Libraries doesn’t currently contract with a web archiving service and we are still exploring options for web acquisition, access and preservation.

But that doesn’t mean we don’t have immediate web archiving needs to address in the meantime! We recently had a request from an Institute office to start archiving a student handbook website. This request pushed us ahead, so without all the planning in exact place, and deviating slightly from the plan, we found a solution to meet immediate, small-scale needs.  After considering a few options, WebRecorder.io released a beta version in May. This tool is super easy to use, creates WARCs, and has a partner project for offline WARC playback (Web Archive Player). We are currently using WebRecorder.io Beta for small-scale web acquisition on a selective basis.

I am so very excited about this and even thought this tool is providing a timely solution – we are not abandoning technology exploration, documentation or policy work! I will be posting more on WebRecorder.io and program documentation on this blog and Engineering the Future of the Past over the next few months.