Posted: October 24, 2019

Workshop: Natural Language Processing (NLP) and Machine Learning for Digital Curation

Sangeeta Desai, Kam Woods, Cal Lee | BitCurator Consortium

This workshop will be an interactive session about use of open-source natural language processing (NLP) and machine learning (ML) tools to process and provide access to born-digital materials. It will focus on applying topic modeling and named entity recognition to characterize and explore contents of removable storage media (e.g. floppy disks, optical media) – functionality developed through the BitCurator Access and BitCurator NLP projects. We will also explore open-source software (OSS) tools and methods for libraries, archives and museums (LAMs) to identify email in born-digital collections, review email sources for sensitive or restricted materials, and perform appraisal and triage tasks to identify and annotate records – specifically on products of the Review, Appraisal and Triage of Mail (RATOM) project’s use of machine learning to separate records from non-records, along with natural language processing methods to identify entities of interest within those records. In addition to gaining hands-on experience using the tools, participants will also learn about the rationale for their development, how they relate to other available software, and how NLP and ML can fit into larger digital curation workflows. We will conclude with a brief discussion of implications for participants in their own institutions.

Cite this resource:
Sangeeta Desai, Kam Woods, Cal Lee. (October 24, 2019). Workshop: Natural Language Processing (NLP) and Machine Learning for Digital Curation. BitCurator Consortium.

Workshop: Natural Language Processing (NLP) and Machine Learning for Digital Curation

Sangeeta Desai, Kam Woods, Cal Lee | BitCurator Consortium

Become a member

Membership is open to libraries, archives, museums, and other institutions worldwide that seek a collaborative community within which they may explore and apply forensics approaches and solutions to their digital collections.