
         Strategic uses of herbaria, specimens, and digital specimen data

A practical, confidence-based method for obtaining research-ready, machine learning derived trait data from herbarium specimens

Presenting Author
Patrick Sweeney
The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen occurrence data and imagery. However, the manual extraction of trait data remains a bottleneck to addressing many kinds of large-scale research questions. Deep learning methods have shown great promise for automating trait annotation from specimen imagery. Herbarium specimens are well suited to automated trait extraction from two-dimensional imagery, yet contemporary deep learning approaches have not performed well enough to be broadly trusted in practice. Using herbarium specimens as a test case, we propose a practical approach to increase the overall accuracy of any classifier by filtering low confidence annotations. We do this by applying an accuracy/coverage trade-off technique to existing classifiers. We show that a model which labels the phenology of herbarium specimen with only 82% accuracy can yield 98% accuracy on a subset of the data. We validate our method by successfully replicating the conclusions of a study built on expert-labeled samples. Our approach can help automatically generate research-grade labeled data allowing researchers to answer biological questions with unprecedented affordability using off-the-shelf models.