CLLGJul 2, 2021

Data Centric Domain Adaptation for Historical Text with OCR Errors

arXiv:2107.00927v11 citations
Originality Incremental advance
AI Analysis

It addresses domain shift and OCR error issues for historical text processing, which is an incremental improvement in a domain-specific area.

The paper tackles Named Entity Recognition on historical Dutch and French texts with OCR errors by proposing methods for in-domain and cross-domain adaptation, achieving state-of-the-art results that outperform strong baselines.

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes