CLIRLGJan 1

Noise-Aware Named Entity Recognition for Historical VET Documents

arXiv:2601.00488v1h-index: 2
Originality Incremental advance
AI Analysis

It addresses NER for historical VET documents, which is a domain-specific problem with incremental improvements in handling noise.

This paper tackles Named Entity Recognition in historical Vocational Education and Training documents with OCR noise by proposing a robust approach using Noise-Aware Training, transfer learning, and multi-stage fine-tuning, resulting in substantial increases in robustness and accuracy under noisy conditions.

This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes