“Web Archiving” DCP SPRUCE Digital Preservation Illustrations
Web archiving is a new endeavor for the MIT Institute Archives and Special Collections (IASC) and I am lucky enough to be able to take the lead on developing a website acquisition process for the archives. As with any other initiative, the work involves a lot of collaboration with colleagues. The following sections highlight some of the activities currently in progress for this project….
Outline; or, evolving list(s) of next step activities
I’ve created a loose project outline that is basically an evolving list of activities grouped into some categories (e.g. making the case for web archiving, collaboration and communication, policy and procedures, planning for acquisition and metadata integration, web acquisition tools and services, access). The tasks and categories don’t necessarily represent a strictly linear process, but help me remember the wide range of elements that help to create a thorough web acquisition workflow. (This blog post, for example, fits within communication!)
In order to make the case for web archiving and set groundwork for moving forward, the first task was to write an informational document that defines types of web archiving and explores the IASC’s vision for how web archiving can be part of the digital archives program. The following is an excerpt from the document:
The International Internet Preservation Consortium (IIPC) states that archiving internet content is “the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use.” In general, the goal of a web archiving program should be to “capture and preserve the dynamic and functional aspects of Web pages – including active links, embedded media, and animation – while also maintaining the context and relationships between files” (Antracoli, et al. 2014). The use of tools, software or hosted services to collect (also referred to as copy, harvest, or crawl) websites is an essential technical step of a web archiving program, but selection and use of a tool(s) shouldn’t be the first or only step of a holistic web archiving effort.
The document continues on to explore some policy and visioning considerations that should ideally come before tool selection or acquisition. One of the most important steps is understanding the purpose of collecting websites. Two common types of web archive initiatives include: general collections/subject area development and archival records collections. For the IASC, the approach for web acquisition is within the area of archival records. This means that we are interested primarily in websites that are records of MIT (mostly, the mit.edu domain). Beyond this initial document describing web archiving, we’ll need to continue to document things like vision and scope, appraisal guidelines (which are, of course, informed by existing collection policy), rights and permissions procedures, workflow and frequency of capture, access for researchers, and preservation. We’re using the Archive-It Web Archiving Life cycle model to help structure the planning and documentation development.
Appraisal of mit.edu domain; or, understanding the place of website records
In addition to exploring a web archiving vision for IASC, I’ve also begun to survey parts of the www.mit.edu domain for sites that are good candidates for acquisition (intellectually, not necessarily technically). In some ways this process is a gap analysis as I consider things like: does this website fit an existing collection? when did IASC last received materials from the office or department? does IASC have any digital content for this collection currently? is this website a replacement of the physical record types that IASC used to receive? It’s not realistic to expect to cover the entire domain with this method, but I still think it’s worth it to spend some time appraising (and documenting appraisal) of a range of websites within the domain. I hope that this process can help us prioritize websites to capture and better understand how websites fit into our collections and finding aids. Throughout this process, I’ve been checking in with Liz, our Archivist for Collections, as her knowledge of the collections and organizational history of the Institute is invaluable!
Technology exploration; or, how this project plan isn’t linear!
It seems like most U.S. archives and libraries engaged in web archiving are using a hosted service. And that is probably because web acquisition is so complex and difficult to do at scale (e.g. all of mit.edu). MIT Libraries doesn’t currently contract with a web archiving service and we are still exploring options for web acquisition, access and preservation.
But that doesn’t mean we don’t have immediate web archiving needs to address in the meantime! We recently had a request from an Institute office to start archiving a student handbook website. This request pushed us ahead, so without all the planning in exact place, and deviating slightly from the plan, we found a solution to meet immediate, small-scale needs. After considering a few options, WebRecorder.io released a beta version in May. This tool is super easy to use, creates WARCs, and has a partner project for offline WARC playback (Web Archive Player). We are currently using WebRecorder.io Beta for small-scale web acquisition on a selective basis.
I am so very excited about this and even thought this tool is providing a timely solution – we are not abandoning technology exploration, documentation or policy work! I will be posting more on WebRecorder.io and program documentation on this blog and Engineering the Future of the Past over the next few months.