CLApr 23, 2020

A Tool for Facilitating OCR Postediting in Historical Documents

Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way

arXiv:2004.11471v131.1997 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This is an incremental tool for researchers and archivists working with historical documents to improve OCR accuracy.

The paper tackles the problem of OCR errors in historical documents by developing a tool that uses a language model to suggest corrections for words not in a vocabulary, tested on a chapter from a 1719 book and shown to successfully correct common errors.

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

View on arXiv PDF Code

Similar