Hespi: A pipeline for automatically detecting information from hebarium specimen sheets
This work addresses the need for efficient data extraction in biological, environmental, and conservation sciences, though it is incremental as it builds on existing computer vision and OCR/HTR techniques.
The authors tackled the problem of inefficient human-mediated transcription of biodiversity data from herbarium specimen sheets by developing Hespi, a pipeline that uses computer vision and multimodal LLMs to automatically detect and extract text, achieving accurate results across international herbaria.
Specimen-associated biodiversity data are crucial for biological, environmental, and conservation sciences. A rate shift is needed to extract data from specimen images efficiently, moving beyond human-mediated transcription. We developed `Hespi' (HErbarium Specimen sheet PIpeline) using advanced computer vision techniques to extract pre-catalogue data from primary specimen labels on herbarium specimens. Hespi integrates two object detection models: one for detecting the components of the sheet and another for fields on the primary primary specimen label. It classifies labels as printed, typed, handwritten, or mixed and uses Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for extraction. The text is then corrected against authoritative taxon databases and refined using a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text from specimen sheets across international herbaria, and its modular design allows users to train and integrate custom models.