web archiving resources for NDSA NE crew (and anyone else reading this!)

This list of resources is shared as a compliment to a presentation I gave at the NDSA New England meeting on September 25, 2015. The presentation discussed the MIT Institute Archives’ efforts to acquire websites without a hosted service. I talked about how technology is important, but policy development and planning are key activities that can be accomplished even if new technology isn’t possible right away. The presentation also highlighted the tools we’re finding useful that are easy for an archivist with limited programming skills to use (web recorder, wget and web archive player). I’ve previously talked about some of these activities on ArchiveHour, see that post here.

P.S. Every time I think I’ve got a handle on the essential web archiving resources, I find out about something new. I also realize that a lot of work has gone into web archiving development long before it was something I first learned about in 2013. With this in mind, it’s quite possible that a lot of good stuff is missing from the following list — please add resources you love in the comments or alert me of my ignorance via contact page. =) thank you!

Get Started

  • International Internet Preservation Consortium (IIPC) website – What is web archiving?
  • IIPC blog post (2015), Ian Milligan – “So You Want to Get Started In Web Archiving?” Provides an excellent list of blogs to follow.
  • Archive-It Web Archiving Life Cycle – the examples are specific to Archive-It service and partners, but in any case the life cycle breakdown and concepts are helpful to think about the range of activities and policy that go into a web archiving program.
  • DPC Technology Watch 13-01, (2013), Maureen Pennock “Web-Archiving”
  • NDSA 2013 Web Archiving in the United States survey report

Continue reading


fellowship update: tool time

One of my goals for the fellowship is to increase overall familiarity and understanding of practical application for various tools useful to digital archives. I try to set aside time each week for testing and learning. Some of the testing I’ve done relates to my work with the Digital Sustainability Lab. Below are a few tools I’ve worked with recently and some others on my “up next” list.

image of gardening tools from 1920s magazine ad

Digital archives toolkits aren’t all that different from gardening toolkits. Gathering, planting and weeding, watering, harvesting… (image: flickr user biodiversity heritage library)

Recently Explored and Tested:

Archivematica (1.4) – Archivematica changed a bit since I last used it in 2013! I’ve been learning more about the storage service options and trying out the new arrangement feature.

ePADD  Email archiving, processing and access from Stanford Libraries. Once we’ve had a chance to work with ePADD more, I’m sure I’ll do some posts here and on Engineering Future of the Past. This tool has me dreaming about a future where all digital archives appraisal and processing incorporates natural language processing and data visualization.

wget (with WARC) – Configuring this tool nearly defeated me. But I got it working on a Mac and have successfully crawled a website with WARC file output! This tool was part of a series of tests for the DS Lab, so I’ll definitely post a more detailed account soon.

TableauThis visualization software is something my fellow Fellow, Christine, and I are working with for our joint project. Once we’re done, I’ll probably share about our project and the work Christine did to get our data visualizin’ with Tableau.

Webrecorder – web archiving for all! I talked about web recorder last month in this post.

Up next:

BitCurator – MIT Libraries is a member of the BitCurator consortium and I’ve used BitCurator a bit in the past. I think it’s high time I increase my familiarity with what’s included in BitCurator and how it might fit into different processing situations.

Lunchbox from NPR – This isn’t a tool that’s really specific to digital archives workflows, but the Waterbug tool could be really useful for prepping images for sharing on social media.

Open Refine – messy data is something that is likely here to stay and I want to know more about how to clean up data efficiently.

MDQC and BWF MetaEdit – AV Preserve tools for checking and adding metadata.

CurateGear 2015

My first professional development activity of 2015 took place last week in lovely (and unseasonably cold) North Carolina. CurateGear is an event hosted by the SILS program at UNC-Chapel Hill that provides practitioners a platform to  both share and learn about the application of digital curation tools. Video of all of the lightening talks can be found on the CureateGear 2015 website.

The following are a few of the things I found most interesting:

Digital Curation Education

  • I found Carolyn Hank’s talk about her research on digital curation education interesting. As a recent graduate, I think there is definitely a need for a clearer understanding of how to best prepare students for the workplace. This topic came around again in the closing panel. The panelists highlighted the notion that students don’t necessarily need to learn specific tools-of-the-month in the classroom, but instead need to learn to think strategically about addressing the concepts that make up digital curation (e.g. ingest in practice, AIP concept modeling, workflows, restricted access methods vs. open access methods, digital forensics use cases).  This is all in my own words, and the panelists phrased it better — I just remember nodding my head a lot and forgetting to take good notes!

Using Data to Plan for Born-Digital Processing

  • Erika Farr talked about her team’s work to measure effort spent in born-digital processing. She explained how her team created effort categories, used Redbooth to track tasks and time, and what they hope to gain from the resulting data. I think this kind of measuring and planning will only become more commonplace as more and more archives are able to ramp up born-digital processing.

Emulation and Access for Digital Archives

  • Susan Malsbury from NYPL talked about their efforts to provide access to a collection of born-digital records from the 1980s/1990s. Currently, the records are available at a reading room station only. Users can view the collection by using Quick View Plus and the DOSBox emulator.

Analyzing Archive-It Collections

  • Lori Donovan, from the Internet Archive, demoed the newest updates to Archive-It. Archive-It 5.0 provides partners with a newly redesigned and expanded reports feature.  It is now possible to explore data captured with useful (and pretty!) visualizations that show size of captures, use of data budget, amount of duplication, file types, and more. The reports feature is only one part of Archive-It 5.0 redesigned – I can’t wait to see what else they’ve updated.

I also want to give a shout out to the talks by my colleagues Nancy McGovern and Kari Smith — the topics of their talks were focused on the high level planning side of things, but planning is essential to curation “gear” selection! The work they discuss is related to so many things I’m working on. I’m sure a future post about my work will reference the management tools for curation and the digital archives eco-system visualization.

Though I was itching for a hands on session to play with all the tools discussed, I enjoyed CurateGear 2015 and the opportunity to see a variety of tools in action.