fellowship update: getting on the same (web) page

“Web Archiving” DCP SPRUCE Digital Preservation Illustrations

Web archiving is a new endeavor for the MIT Institute Archives and Special Collections (IASC) and I am lucky enough to be able to take the lead on developing a website acquisition process for the archives. As with any other initiative, the work involves a lot of collaboration with colleagues. The following sections highlight some of the activities currently in progress for this project….

Outline; or, evolving list(s) of next step activities

I’ve created a loose project outline that is basically an evolving list of activities grouped into some categories (e.g. making the case for web archiving, collaboration and communication, policy and procedures, planning for acquisition and metadata integration, web acquisition tools and services, access). The tasks and categories don’t necessarily represent a strictly linear process, but help me remember the wide range of elements that help to create a thorough web acquisition workflow. (This blog post, for example, fits within communication!)

In order to make the case for web archiving and set groundwork for moving forward, the first task was to write an informational document that defines types of web archiving and explores the IASC’s vision for how web archiving can be part of the digital archives program. The following is an excerpt from the document:

The International Internet Preservation Consortium (IIPC) states that archiving internet content is “the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use.” In general, the goal of a web archiving program should be to “capture and preserve the dynamic and functional aspects of Web pages – including active links, embedded media, and animation – while also maintaining the context and relationships between files” (Antracoli, et al. 2014). The use of tools, software or hosted services to collect (also referred to as copy, harvest, or crawl) websites is an essential technical step of a web archiving program, but selection and use of a tool(s) shouldn’t be the first or only step of a holistic web archiving effort.

The document continues on to explore some policy and visioning considerations that should ideally come before tool selection or acquisition. One of the most important steps is understanding the purpose of collecting websites. Two common types of web archive initiatives include: general collections/subject area development and archival records collections. For the IASC, the approach for web acquisition is within the area of archival records. This means that we are interested primarily in websites that are records of MIT (mostly, the mit.edu domain). Beyond this initial document describing web archiving, we’ll need to continue to document things like vision and scope, appraisal guidelines (which are, of course, informed by existing collection policy), rights and permissions procedures, workflow and frequency of capture, access for researchers, and preservation. We’re using the Archive-It Web Archiving Life cycle model to help structure the planning and documentation development.

Appraisal of mit.edu domain; or, understanding the place of website records 

In addition to exploring a web archiving vision for IASC, I’ve also begun to survey parts of the www.mit.edu domain for sites that are good candidates for acquisition (intellectually, not necessarily technically). In some ways this process is a gap analysis as I consider things like: does this website fit an existing collection? when did IASC last received materials from the office or department? does IASC have any digital content for this collection currently? is this website a replacement of the physical record types that IASC used to receive? It’s not realistic to expect to cover the entire domain with this method, but I still think it’s worth it to spend some time appraising (and documenting appraisal) of a range of websites within the domain. I hope that this process can help us prioritize websites to capture and better understand how websites fit into our collections and finding aids. Throughout this process, I’ve been checking in with Liz, our Archivist for Collections, as her knowledge of the collections and organizational history of the Institute is invaluable!

Technology exploration; or, how this project plan isn’t linear! 

It seems like most U.S. archives and libraries engaged in web archiving are using a hosted service. And that is probably because web acquisition is so complex and difficult to do at scale (e.g. all of mit.edu). MIT Libraries doesn’t currently contract with a web archiving service and we are still exploring options for web acquisition, access and preservation.

But that doesn’t mean we don’t have immediate web archiving needs to address in the meantime! We recently had a request from an Institute office to start archiving a student handbook website. This request pushed us ahead, so without all the planning in exact place, and deviating slightly from the plan, we found a solution to meet immediate, small-scale needs.  After considering a few options, WebRecorder.io released a beta version in May. This tool is super easy to use, creates WARCs, and has a partner project for offline WARC playback (Web Archive Player). We are currently using WebRecorder.io Beta for small-scale web acquisition on a selective basis.

I am so very excited about this and even thought this tool is providing a timely solution – we are not abandoning technology exploration, documentation or policy work! I will be posting more on WebRecorder.io and program documentation on this blog and Engineering the Future of the Past over the next few months.



fellowship update: six months

March flew by and marked the sixth month of my twenty-four month fellowship. An early lesson learned from the Fellowship experience: keeping up with blogging goals ain’t easy! I’ve found that the time I was setting aside for reading has largely become time to coordinate and read for the Archives and Digital Curation Reading Group. All in all not a bad thing, but I’m striving to get back to my original bloggin’ and readin’ plan in May…well, maybe June!

Below is a breakdown of current work and other things I’ve done as part of the MIT Libraries team so far.

MIT seal

MIT Context – Since day one, I’ve been learning about the history of MIT. This has involved looking over reference requests, learning about the interests of visiting researchers, reading the book A Widening Sphere, taking a student led campus tour, and general poking around in collection information. The Wikipedia event I co-organized also helped me learn about women at MIT. Did you know… MIT was founded in 1861 just days before the Civil War started? Classes didn’t begin until 1865. MIT was originally located in Boston’s Back Bay and moved to Cambridge in 1916. The MIT motto is Mans et Manus (Mind and Hand).

Acquisition and Processing Adventures – I’ve had the opportunity to observe the acquisition process and procedures for  transfer and submission of born-digital records. This has included tagging along with Kari to a few Institute offices for records, walking through procedures for submitting born-digital files for ingest processing with a collections archivist, reworking the written procedures for submission, and thinking generally about PAIMIS concepts in practice. I’ve also started processing a recently acquired born digital collection.

Research and Fellow Project – In addition to general work duties, the Library Fellows program gives the fellows an opportunity to create and present on a capstone like project. My project isn’t set in stone yet, but it may be in the area of web archiving. I’ve been spending a good deal of time research methods and modes of web archiving, considering the various MIT owned websites that might interest the Institute Archives, and thinking about web archiving experiments we might be able to test in the DSLab. I’m also working on a internal documentation related project with my fellow Fellow, Christine Malinowski. As that develops, I hope to post about our work.

i love these digital preservation images!

like many in archives and digital curation, my work involves a lot of research, documentation and planning. i ❤ documentation.


Reference/Outreach – I have officially begun working on the reading room reference desk once a week. I learned some strategies for exploring a disk image to meet needs of a reference request. You can learn more about this process on Kari’s blog. In April, I’m hosting two workshops on personal digital archiving. The workshop remixes this Personal Digital Archiving activity from the Society of Georgia Archivists. After the event, I’ll be sure to post about it and share materials. Back in January, I co-hosted a Wikipedia editing workshop with Greta Suiter. I’ve also been thinking about how to provide record creators with advice on creating and archiving digital content – so far I’ve gathered many external resources and created draft guides for PDF-A creation and use of Drop Box + MIT.

The Digital Sustainability Lab (DSLab)- I’ve been working closely with Kari Smith and Nancy McGovern on getting the lab ready for experiments. This has included lots of discussion about documentation and categorization of lab activities. One of the first things I’ve done is put together several sets of test files that can be used for experiments.  I sourced these files from a test corpus already used in the archives, Open Preservation Foundation test corpus, creative commons music websites, and the Personal Digital Archiving activity test files from Society of Georgia Archivists. So far, I’ve documented the master set of each test corpus with DROID, a narrative account of my process, and use guidance.

Conferences/Extracurricular Activities

Whew! It’s been a fun six months (well, seven at this point!). One of my goals for this summer is to write a few posts that provide more detail into these work areas. Stay tuned.