CLSep 13, 2021

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

Juan Ramirez-Orta, Eduardo Xamena, Ana Maguitman, Evangelos Milios, Axel J. Soto

arXiv:2109.06264v31.221 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the issue of improving text accuracy in OCR-processed documents for applications like digitization and archiving, representing an incremental advance with specific gains.

The paper tackles the problem of correcting errors in documents after Optical Character Recognition (OCR) by proposing a method that splits input into character n-grams and uses an ensemble of sequence models with a voting scheme, achieving state-of-the-art performance in five out of nine languages in the ICDAR 2019 competition.

In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

View on arXiv PDF Code

Similar