Oral Paper

         Biodiversity Informatics & Herbarium Digitization

Humans in the Loop: Citizen science and machine learning synergies for overcoming herbarium digitization bottlenecks

Presenting Author
Robert Guralnick
Description
The slowest step in natural history collections digitization is converting imaged labels into digital text. This long-recognized efficiency bottleneck can be overcome and we present a working solution that leverages synergies between citizen science efforts and machine learning approaches. We present two new semi-automated services. The first detects and classifies typewritten, handwritten or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to OCR label text. The label finder and classifier was built via humans-in-the-loop processes that utilize the citizen science Notes from Nature (NFN) platform to develop training and validation datasets to feed into a machine learning pipeline. Our results showcase >93% success for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4-fold reductions in errors compared to off-the-shelf open source solutions.  The OCR workflow also has a human validation using a custom NFN tool. Our work showcases a usable set of tools for herbarium digitization including a custom-built web application accessible to all. Further work to better integrate these services into existing toolkits can support broadest community use.