Workshop: Natural Language Processing (NLP) and Machine Learning for Digital Curation

Sangeeta Desai, Kam Woods, Cal Lee | BitCurator Consortium

This workshop will be an interactive session about use of open-source natural language processing (NLP) and machine learning (ML) tools to process and provide access to born-digital materials. It will focus on applying topic modeling and named entity recognition to characterize and explore contents of removable storage media (e.g. floppy disks, optical media) – functionality developed through the BitCurator Access and BitCurator NLP projects. We will also explore open-source software (OSS) tools and methods for libraries, archives and museums (LAMs) to identify email in born-digital collections, review email sources for sensitive or restricted materials, and perform appraisal and triage tasks to identify and annotate records – specifically on products of the Review, Appraisal and Triage of Mail (RATOM) project’s use of machine learning to separate records from non-records, along with natural language processing methods to identify entities of interest within those records. In addition to gaining hands-on experience using the tools, participants will also learn about the rationale for their development, how they relate to other available software, and how NLP and ML can fit into larger digital curation workflows. We will conclude with a brief discussion of implications for participants in their own institutions.

Read More
BitCurator Consortium Presentations
Cite this resource:
Sangeeta Desai, Kam Woods, Cal Lee. (October 24, 2019). Workshop: Natural Language Processing (NLP) and Machine Learning for Digital Curation. BitCurator Consortium.