Session 7
Karl Blumenthal and Sarah Beth Seymore, Internet Archive | BitCurator Consortium
Enhancing Use of Born-Digital Collections Using ARCH
Understanding how cultural heritage practitioners can utilize artificial intelligence and machine learning to enhance access to born-digital collections is a massive barrier to exploratory investigation of AI/ML tools and methods. Yet, computational methods, like text mining and data visualization, show promising pathways for scholarly use of born-digital archives.
In this demonstration, Internet Archive’s Community Programs team will show participants how ARCH (Archives Research Compute Hub)–an open-source platform for building, analyzing, and generating datasets–can increase access to web archives and other digital collections through computational analysis. By sharing AI/ML skill sets and methods for use, cultural heritage practitioners across a range of organizational backgrounds will be better prepared to steward born-digital collections in a way that promotes computational engagement, scholarship, and research. Using a dataset of plain text webpages, we’ll show participants how to use ARCH to explore data in the command line with Jupyter Notebooks and mine text using Voyant.
By the end of the demonstration, participants will understand how ARCH can be used to build custom research collections relevant to a wide range of subjects; generate, access, and analyze research-ready datasets from collections; and publish and preserve these datasets. ARCH is a paid Internet Archive service that leverages IA’s nonprofit owned compute and storage infrastructure to support working computationally with collections as data at scale. Underlying ARCH code is made available open source at https://github.com/internetarchive/arch. Forum attendees will have access to ARCH during and after the Forum.
Links
- Data & slides (downloadable ZIP file): bit.ly/bcc-arch
- ARCH (Archives Research Compute Hub): https://webservices.archive.org/pages/arch
- CARTA (Collaborative ART Archive): https://carta.archive-it.org/
- Digital scholarship and the web: Exploring new sources and emerging research methods: https://archive-it.org/post/digital-scholarship-and-the-web/
- WARC specification repository (IIPC): https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
- Introduction to the WARC file (Archive-It): https://archive-it.org/post/the-stack-warc-file/
- ARCH Github repository: https://github.com/internetarchive/arch
- Voyant: https://voyant-tools.org/
- Latest tutorials (and videos!): https://arch-webservices.zendesk.com/hc/en-us/articles/15772545086612-Try-it-yourself-Sample-ARCH-datasets-and-how-to-explore-them
Karl Blumenthal and Sarah Beth Seymore, Internet Archive. (March 21, 2024). Session 7. BitCurator Consortium.