Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models
This work addresses accessibility issues for practitioners and digital humanities scholars by enabling efficient domain adaptation without labeled data, though it is incremental as it builds on existing methods like T5 and BART.
The paper tackles the problem of high computational costs in domain adaptation for optical character recognition by proposing a modular detection-and-correction framework that achieves near-state-of-the-art accuracy with single-GPU training, reducing compute by approximately 95% compared to end-to-end transformers.
Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.