Historical German Text Normalization Using Type- and Token-Based Language Modeling
This addresses the challenge of processing historical digitized texts for search and NLP, though it is incremental as it builds on existing methods for a specific domain.
The paper tackles the problem of normalizing historical German texts from 1700-1900 to modern spelling, using a Transformer-based approach that combines type- and token-level modeling, achieving state-of-the-art accuracy comparable to larger end-to-end systems.
Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.