There are many workflow and policy decisions to be made in the acquisition, processing and preservation of digital archival content. When it comes to preservation, determining a strategy for file format preservation is very important. Here at IASC, we’ve recently implemented Archivematica and with this tool comes the need to make specific decisions about ingest and preservation actions for various file types in order to build our processing workflow.
This requires making sense of a large amount of existing digital archival content (a backlog, if you will). We want to easily see things about the collections like: file format types in all collections, file format type by collection, and mismatched extensions. By easily identifying file formats, we can begin to work with Nancy McGovern (in Libraries preservation unit) to determine workflow options for Archivematica and consider digital preservation strategy with an detailed understanding of formats already in our collections.
So, how’d we do it? Well, Kari initially ran a DROID report of one of our storage areas and wanted to visualize the data. Excel was used to create a pie chart of puids. When I saw this, I thought that Tableau Desktop (a visualization software I’m using for another project) could show the data in a more dynamic way. Using Open Refine, I cleaned up the DROID report a bit and parsed collection IDs from file paths into a separate column. From there, I used Tableau to create a several different views of the data. The visualizations are interactive and allow a user to filter and hover over data points for further detail. The images below provide two examples.
This shows formats within a specific collection.
A look at last modified dates by year for a variety of Microsoft office file types across all collections.
In addition to giving quick insight about our collections, the visualizations also raise a lot of questions regarding seemingly strange files or mismatched extension issues. One nice thing about Tableau is that the underlying data is always just a click away. We can go to the spreadsheet and take a closer look at specific files if needed.
Tableau has been pretty easy to learn so far. It’s all drag and drop based to arrange the underlying data into a variety of visualization options. Tableau even suggests the best visualizations based on dimensions and measures used. I still have a lot of learn about Tableau. My fellow Library Fellow, Christine, is organizing a MIT Libraries Tableau group. I hope the group and continued experimenting with Tableau will help IASC make the most of these visualizations. Next up might be some of our reference and reading room stats!
(Also – check out the U-M Bentley Historical Library post for more ideas, tools and techniques for identifying and characterizing sets of files. I’m hoping to try their methods out too.)