CVDLGNApr 5, 2023

Efficient OCR for Building a Diverse Digital History

Harvard
arXiv:2304.02737v235 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the issue of limited access to diverse historical documents for users of digital archives, representing an incremental improvement over existing OCR methods.

The study tackled the problem of unrepresentative digital archives by modeling OCR as a character-level image retrieval problem using a contrastively trained vision encoder, resulting in a more sample-efficient and extensible model that enables accurate OCR in low-resource settings.

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes