fellowship update: viz for strategy

There are many workflow and policy decisions to be made in the acquisition, processing and preservation of digital archival content. When it comes to preservation, determining a strategy for file format preservation is very important. Here at IASC, we’ve recently implemented Archivematica and with this tool comes the need to make specific decisions about ingest and preservation actions for various file types in order to build our processing workflow.

This requires making sense of a large amount of existing digital archival content (a backlog, if you will). We want to easily see things about the collections like: file format types in all collections, file format type by collection, and mismatched extensions. By easily identifying file formats, we can begin to work with Nancy McGovern (in Libraries preservation unit) to determine workflow options for Archivematica and consider digital preservation strategy with an detailed understanding of formats already in our collections.

So, how’d we do it? Well, Kari initially ran a DROID report of one of our storage areas and wanted to visualize the data. Excel was used to create a pie chart of puids. When I saw this, I thought that Tableau Desktop (a visualization software I’m using for another project) could show the data in a more dynamic way. Using Open Refine, I cleaned up the DROID report a bit and parsed collection IDs from file paths into a separate column. From there, I used Tableau to create a several different views of the data. The visualizations are interactive and allow a user to filter and hover over data points for further detail. The images below provide two examples.


This shows formats within a specific collection.


A look at last modified dates by year for a variety of Microsoft office file types across all collections.

In addition to giving quick insight about our collections, the visualizations also raise a lot of questions regarding seemingly strange files or mismatched extension issues. One nice thing about Tableau is that the underlying data is always just a click away. We can go to the spreadsheet and take a closer look at specific files if needed.

Tableau has been pretty easy to learn so far. It’s all drag and drop based to arrange the underlying data into a variety of visualization options. Tableau even suggests the best visualizations based on dimensions and measures used. I still have a lot of learn about Tableau. My fellow Library Fellow, Christine, is organizing a MIT Libraries Tableau group. I hope the group and continued experimenting with Tableau will help IASC make the most of these visualizations. Next up might be some of our reference and reading room stats!

(Also – check out the U-M Bentley Historical Library post for more ideas, tools and techniques for identifying and characterizing sets of files. I’m hoping to try their methods out too.)


fellow update: code4lib 2015

Okay, so, February has not been a good month for blog updates or reading. But I did have the opportunity to attend Code4Lib 2015 in Portland! The atmosphere was reflective (this was the 10th year of C4L) and really fun. While conferences with different tracks and loads of presentations can be great, I really appreciated the single track style of C4L. I liked hearing complete talks (no fear of missing out!) and having a shared experience with other attendees.

I’ve been working on summarizing my experience since I got back and I’ve decided that, for me, the talks fit into two broad categories: teaching/learning/culture and ideas/projects/tools. 

In this post I’ve listed some of the highlights from each category. There were many other great talks and projects – the Code4Lib wiki has slides available for most of the presentations and lightning talks. Or check out the video of some of the talks.

Continue reading

reading notes: podcast edition

Now that I’ve officially left 2008 behind and upgraded to a smart phone (and apparently contributed to bonkers Apple profits), I’ve been on the hunt for archive and library related podcasts. An obvious first choice is the More Podcast, Less Process podcast from the Keeping Collections project (a METRO project). So far there are ten episodes and I elected to start listening with episode seven because I’m a rebel!

The episode from early 2014, Humans.txt.mp3-The Web Archivists are Present, is focused on web archiving pursuits and challenges. All the usual suspects are discussed: staffing needs and skills, difficulty of crawling dynamic sites, challenges with getting full captures, deciding what to capture and how to scope crawls, topical vs. institutional web collections, facilitating searching and access, permissions and robots.txt decisions.

Even better the discussion places these topics in two specific institutional contexts – Columbia University Library web archives (established program) and the New York Art Resources Consortium (new program). Both institutions are Archive-It partners, so web collecting is discussed within the Archive-It model. At the very end, the group raised a couple questions I found particularly worthy of consideration: Are website collections really archival collections? Can web archives ever be organic collections or are they always artificial collections created by the web archivist?  …Which, for me, begs the question, how much does it really matter either way? Another possible way to approach this specific discussion might be considering the merits, similarities and differences between web archiving as collection development and web archiving as records management.

I don’t have much else in the way of critique or further discussion, but I always enjoy learning about how other professionals are developing web archive programs (because it’s no simple task!). Go listen! Other web archiving things I’ve viewed, read or scanned recently:

Do you know of any other archive-library related podcasts? This is what I’ve found so far:

Today’s coffee: Starbucks Veranda

The reading notes posts found on this blog are intentionally question-filled and causal. Each notes post serves as a sort of open journal record of my professional development reading as the MIT Libraries Fellow for Digital Archives. See the introduction post for more on this series. I welcome suggestions for future readings—current or archival!

CurateGear 2015

My first professional development activity of 2015 took place last week in lovely (and unseasonably cold) North Carolina. CurateGear is an event hosted by the SILS program at UNC-Chapel Hill that provides practitioners a platform to  both share and learn about the application of digital curation tools. Video of all of the lightening talks can be found on the CureateGear 2015 website.

The following are a few of the things I found most interesting:

Digital Curation Education

  • I found Carolyn Hank’s talk about her research on digital curation education interesting. As a recent graduate, I think there is definitely a need for a clearer understanding of how to best prepare students for the workplace. This topic came around again in the closing panel. The panelists highlighted the notion that students don’t necessarily need to learn specific tools-of-the-month in the classroom, but instead need to learn to think strategically about addressing the concepts that make up digital curation (e.g. ingest in practice, AIP concept modeling, workflows, restricted access methods vs. open access methods, digital forensics use cases).  This is all in my own words, and the panelists phrased it better — I just remember nodding my head a lot and forgetting to take good notes!

Using Data to Plan for Born-Digital Processing

  • Erika Farr talked about her team’s work to measure effort spent in born-digital processing. She explained how her team created effort categories, used Redbooth to track tasks and time, and what they hope to gain from the resulting data. I think this kind of measuring and planning will only become more commonplace as more and more archives are able to ramp up born-digital processing.

Emulation and Access for Digital Archives

  • Susan Malsbury from NYPL talked about their efforts to provide access to a collection of born-digital records from the 1980s/1990s. Currently, the records are available at a reading room station only. Users can view the collection by using Quick View Plus and the DOSBox emulator.

Analyzing Archive-It Collections

  • Lori Donovan, from the Internet Archive, demoed the newest updates to Archive-It. Archive-It 5.0 provides partners with a newly redesigned and expanded reports feature.  It is now possible to explore data captured with useful (and pretty!) visualizations that show size of captures, use of data budget, amount of duplication, file types, and more. The reports feature is only one part of Archive-It 5.0 redesigned – I can’t wait to see what else they’ve updated.

I also want to give a shout out to the talks by my colleagues Nancy McGovern and Kari Smith — the topics of their talks were focused on the high level planning side of things, but planning is essential to curation “gear” selection! The work they discuss is related to so many things I’m working on. I’m sure a future post about my work will reference the management tools for curation and the digital archives eco-system visualization.

Though I was itching for a hands on session to play with all the tools discussed, I enjoyed CurateGear 2015 and the opportunity to see a variety of tools in action.

reading notes: visualizing robotics history

Milojević, Staša, and Selma Šabanović. 2013. “Conceptual Foundations for Representing Robotics History in a Non‐linear Digital Archive.” Library Hi Tech 31 (2). Emerald Group Publishing Limited: 341–54. doi:10.1108/07378831311329095.

“Current online oral history archives are often forced into flat linear structure. … We want to take advantage of full capabilities of current technology to allow for non-linear presentations of narratives and data that do not conform to rigid timelines nor are forced into presenting a single aspect of the phenomenon.” p. 351

The project that this article describes aims to capture oral history accounts of the development of robotics and then use the resulting data alongside bibliometric data to create visualizations that position the history of robotics within a “knowledge ecology.” Thinking of the field of robotics — or any field, really — as a knowledge ecology allows one to consider the “interrelationships within and between the institutional, social, cognitive, historical, and material factors” that affected the development of a discipline. This moves the emphasis from a strictly linear timeline (based on publications alone) to a more context based, non-linear exploration (p.343).

The resulting collection, in the case of this project, allows a user to learn about the “local and personal understandings of robotics” as well as the “broader systemic picture” (p. 343). Meaning that the non-linear oral history accounts are placed within the context of the more linear timeline derived from bibliometric data (publications, patents, conferences).  Continue reading

fellow update: what is it… you do here?

What does a MIT Libraries Fellow for Digital Archives do?

This question and similar derivative questions (Digital preservation is what? Digital curation means what?) have been consistent companions since I started graduate school in 2012. Explaining the details of digital curation and preservation is a topic big enough for it’s own post. I’m not going to venture there today, but I would like to highlight what a Digital Archives Fellowship is all about. My quick answer is that the fellowship experience provides built-in mentorship and an opportunity to continue to develop skills that help me wrangle digital content and context. As far as what I do here …

Office Space gif

Office Space gif from http://gph.is/1rZtor0

My official project plan is still in development. As my projects become more defined, I’ll provide updates about my work on this blog. But, essentially, I’m here to help the Digital Archivist in furthering the development of the digital archives program. This means we are working on things like:

  • Workflow analysis and documentation related to transferring, processing and managing digital collections
  • Testing and development of use cases for various digital archives tools that assist or automate processing and management activities (e.g. Archivematica, ArchivesSpace, BitCurator)
  • Enabling access to digital collections in the reading room and online

As a fellow, I also have the luxury of extra time for skill building and professional development. So far, I’ve set aside time each week for this blog as well as for developing programming skills. First up, is Python. I had hopes of taking an edX course in data analysis using Python, but my hopes were dashed quickly (by the second week). I learned the basics of Python in graduate school, but those seven weeks felt like they took place seven years ago! My retention was pretty bad, so I’m back to basics. Through Code Academy and a textbook from graduate school, I’m relearning Python. I’ll provide an update on my progress once I have a better idea of how I might use Python—maybe to analyze digital collections or automate part of a workflow. For now, here is an example of my brilliant Python abilities.

Image of very simple Python code