CLDec 19, 2023

REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao

arXiv:2401.02972v2h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses data quality issues for historians and archivists using digitized historical records, but it is incremental as it builds on existing HTR and entity extraction techniques.

The project tackled improving named entity extraction from historical handwritten text recognition (HTR) outputs, specifically for death certificates from Curacao, finding high precision for dates but low precision for person names, with methods to enhance name extraction.

We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes