BitCurator: Beyond Environment

Guest post by Brian Dietz
This is a modified version of “BitCurator: Beyond Environment,” a talk I delivered remotely at the BitCurator Users Forum 2018, “Living on the Edge: Extending Digital Forensics into New Sectors.” The slides, with notes, are available through the community shared presentation and notes folders. I’d like to thank Sam Meister for offering me the opportunity to share the presentation as a Case Study.

Intro
In early 2018, the NCSU Libraries shifted our working environment for the ingest and processing of born digital archival materials. For the past several years, we had worked on a Windows computer running the BitCurator virtual environment (BCVM). Our current work is done in Mac OSX, and largely at the command line. Our motivation was to simplify both our processing procedures and the management of our computing environment. While we currently do not use the BitCurator VM, what we learned through our use of that environment and our involvement with the community kickstarted our work in digital archiving in a way that assembling tools on our own would not have.

Context
NCSU began “modernizing” our approach to born digital starting in 2013, and something similar to our current approach has been in production since 2015. Born digital processing is housed in the Special Collection Research Center, and it is co-managed by staff in the Digital Program, which manages the technical aspects of the work, and Technical Services, which advises on issues related to arrangement and description. The department has a born digital advisory group that includes representatives from management and researcher and curatorial services. We collaborate closely with the Libraries’ Digital Library Initiativeson the development of web applications. Processors are a combination of graduate student staff who work exclusively on born digital materials and other processing staff who work mostly on physical materials.

Our goal is to have a standard set of workflows that are accessible to folks who aren’t well-versed in digital archiving. To help with this, we developed a processing wizard. DAEV (Digital Assets of Enduring Value) guides a processor through a session, providing explicit instructions on actions to take; records processor actions and generates files containing preservation metadata about those actions; associates a processing session with an ArchivesSpace archival object record; and creates a digital object record for a session’s archival package and associates it with the appropriate archival object record. DAEV is built using open source technologies, and it is highly tailored to the decisions that we’ve made about workflows. DAEV is a web application that includes a Ruby on Rails backend, Angular.js frontend, a YAML file where workflows are defined, and Markdown files containing the instructional text for workflows.

In a processing session, processors begin by analyzing media and content characteristics. They then select a media type and workflow such as “image 3.5” disk” or “tar files transferred electronically."

Figure 1. Processors select media type and select workflow, 3.5” floppy

DAEV then guides processors through the session, which includes describing the media object in ArchivesSpace, stabilizing and retrieving content, and generating and reviewing various reports. Outside of ArchivesSpace, while processing media, most of our work is done at the command line (the exception being the KryoFlux GUI). DAEV provides a set of step-by-step commands the processors copy from DAEV and paste into a command prompt. This provides for simpler, more efficient processing workflows, while also increasing the comfort level for processors using a command line interface.

A consequence of having pre-defined workflows is the risk of inflexible procedures. One way we’ve attempted to mitigate this is through the use of variables. In a session, processors analyze media and content characteristics and provide DAEV certain values to be used as input for variables in commands. Variables include volume name and mount point of a disk (used in imaging or taring); a filesystem code and sector offset (used in tsk_recover in brunnhilde), and image file extension.

Figure 2. Processors provide variable values to DAEV to be used in later commands

This approach, where an application provides exact commands to processors, has worked well so far. We expect to run into cases where we’ll need more flexibility, in which case we may do additional preparatory work or adjust DAEV commands. (Most steps in DAEV provide a “notes” field for processors to record things like this.). It may also provide us an opportunity to look into adjusting workflows to better balance standardization and flexibility. The imaging workflow is by far the most difficult, and it requires the most processor analysis and feedback, and it is the workflow that I anticipate needing the most adjustment in the future.

Shifting Environments
The shift to a mostly CLI environment isn’t the only one we’ve made recently. To better support our work, we also shifted our processing environment at the beginning of this year. Until January 2018 we ingested digital media using the BitCurator VM running on a Windows 7 machine. The BCVM bootstrapped our operations, and much of what I think we’ve accomplished over the last several years would not have been possible without this set up. And we could have continued using this set up with our new workflows, but decided against it.

The desire to move to CLI meant a need for a nix environment. Cygwin for Windows is not a realistic option, and the Linux subsystem, available on Windows 10, had not been released. Linux also wasn’t an ideal option; while our Libraries’ IT supports Linux in certain use cases, there is better support for Windows and Mac. Personally, I no longer wanted to manage virtual machine distributions, and a dual boot machine seemed too inefficient. Also, of the three major operating systems, I’m most familiar and comfortable with Mac OSX, which is UNIX under the hood, and certified as such. Additionally, Homebrew, a package manager for Mac, made installing and updating the programs we needed, as well as their dependencies, relatively simple. Homebrew users can create a list of packages on their systems, called a Brewfile, which can be used to install or reinstall packages as needed. The Brewfile can also be shared so others can install the same packages on their systems.

In addition to Homebrew, we use pip to update brunnhilde and freshclam, included in ClamAV, to keep the virus database up to date. HFS Explorer, necessary for exploring Mac-formatted discs, is a manual install and update, but it might be the main pain point (and not too painful yet). With the exception of HFS Explorer, updating is done at the time of processing, so the environment is always fresh.

Figure 3. Processors update their working environment at time of work

Our current workstation is a Mac Pro, including:

3.7 GHz processor
32GB memory
1TB hard drive
5TB external drive

The main tools we use are:

Exploration
diskutil (to find disk number)
gls (to find volume name, where the GNU version shows escapes (“\”) in print outs)
hdiutil (for mounting images)
mmls (to find partition layout of disks)
drutil status (to show information about optical media)

Packaging
tar (for packing content from media not being imaged)
ddrescue (for disk imaging)
cdparanoia (for packaging content from audio discs)
KryoFlux GUI (for floppy imaging)

Reports
brunnhilde (for file and disk image profiling)
bulk_extractor (for PII scans)
clamav (for virus scans)

Metadata
exiftool
mediainfo

BitCurator and You/Me/Us

If we’re currently doing our processing work without BitCurator, how has our relationship with BitCurator changed? If you’re reading this, you’ve likely already decided that some forensic techniques are appropriate for working with digital archival materials. While we’re applying certain forensic interventions for our workflows, they exist on a continuum and vary across workflows. Some resource types get a few interventions, some get a lot. When we were using the BCVM, some happened within the BitCurator environment, some didn’t, like disk imaging (FTK or KryoFlux). And what’s likely is that we each may do our work slightly differently, making decisions based on media type, institutional preferences, risk tolerance, and so on. University Archives, for instance, are not enthralled with the idea of disk-level access. My institution does not currently image any media that’s been used “merely” as a file transfer mechanism.

And similar to how we each may have slightly different relationships to forensic applications, we may each intersect with BitCurator in slightly different ways. For some, “BitCurator” connotes a processing environment, while for others it’s a consortial relationship, a Google Group to get expert advice on tools and workflows, or the Users Forum. NCSU has in its queue of upcoming work the exploration of Access Webtools and the NLP project. But, for me, above all else, BitCurator is a community, and while my institution currently isn’t using the BitCurator VM, I like to believe that we’re maintaining several connections with the community.

Credits
I’d like to thank several of my colleagues for their contributions to the procedures and practices written about here, including Trevor Thornton, the developer and maintainer of DAEV; Jason Evans Groth who contributed to DAEV’s inception and design; Cathy Dorin Black, Jessica Rayman, Erin Gallagher, Jessica Serrao, Taylor deKlerk, Meredith Campbell, and Hayley Wilson, who have contributed to various improvements of workflows and documentation through their insights gained while processing materials in testing and production environments; Linda Sellars, Head of Technical Services, who co-manages our born digital program; Eli Brown, Gwynn Thayer, and Todd Kosmerick for their guidance on issues related to curation, researcher services, and general management of digital archival assets; and Jason Ronallo for his ongoing support of and advocacy for archival digital collections.