Session 1: Scalability and Automation

David Cirella, Grete Graf, Lynn Moulton, Joanna White | BitCurator Consortium

Talk 1: Scalability, automation and open source tools at the British Film Institute (00:01:53)

Joanna White

The British Film Institute’s (BFI) National Archive recently started a preservation project to convert 3PB of DPX film scans into FFv1 Matroska video files using automation scripts written by staff not formally trained as developers. This involves the use of RAWcooked, a lossless compression software produced by MediaArea, a consortium of developers and archivists producing tools with preservation interests at their core. Using RAWcooked the BFI has seen significant data size reductions which in turn has led to reduced network impact, provides an opportunity to review film scans via the Archive’s media asset management platform and will provide a long term benefit to the BFI in cost reductions for preservation storage. These benefits complement the primary driver for the project: digital preservation using a lossless, open, standards-based format that is increasingly adopted by public archives. This talk will draw on a recent blog which shares in full this project’s bash scripts, tools and approaches. Our mass digitisation project continues to reveal complications associated with upscaling automation of such diverse collections, and our scripts continue to respond to these changes. The open source preservation community actively develop scripts and software together, freely sharing knowledge to better improve the skills base of the sector. This methodology works against some historical tendencies for institutions to rely on broadcast solutions or vendor developed closed technologies. This talk reflects on this open source ethos, and the complications and benefits encountered by collections seeking stable futures by divesting from the built in obsolescence of commercial alternatives.

Talk 2: One Byte at a Time: Small Steps for Dismantling Technological Knowledge Barriers (00:10:41)

Lynn Moulton

This talk was inspired by a discussion about imposter syndrome that occurred at last year’s BitCurator User Forum. Why do so many GLAM practitioners feel this way about digital forensics? How can we conquer it if we have limited time and resources? We decided to grab a buddy and jump off the deep end. This talk will share what happened when a preservation librarian and a processing archivist tackled using freely available open-source tools from home to incrementally build the technical skills necessary for born-digital archiving. We’ll discuss the obstacles we encountered and the lessons we learned along the way.

Talk 3: Millions and Millions: Scaling digital preservation at the brink of 100 million (00:18:52)

David Cirella and Grete Graf

Automation is a necessity for working with collections of digital objects that number in the tens-of-millions. The scale of such collections makes it infeasible to manually package, ingest, and verify stored objects. While the specific steps of packaging and ingest vary across preservation systems, the process involves rearranging files into intellectually consistent objects. It also involves planning around the normative structure of the preserved content and mechanisms to confirm that objects have been successfully ingested. This presentation elaborates two case studies of automating the packaging and ingest process of electronic resources based on readily available tools on Windows and Linux/Mac platforms. One approach relies primarily on scripting through Python and Bash Shell, while the other utilizes Excel and Powershell. Reporting options available through APIs and PostGreSQL to verify ingests will also be highlighted. The presentation will also address lessons learned throughout the process and planned next steps. Implementing digital preservation at scale requires flexibility and a willingness to explore different methods and tools.

Chat Log Slides
BitCurator Consortium Presentations

Cite this resource:
David Cirella, Grete Graf, Lynn Moulton, Joanna White. (October 13, 2020). Session 1: Scalability and Automation. BitCurator Consortium.