CLJul 3, 2023

Estimating Post-OCR Denoising Complexity on Numerical Texts

Arthur Hemmer, Jérôme Brachat, Mickaël Coustaty, Jean-Marc Ogier

arXiv:2307.01020v10.94 citationsh-index: 30

Originality Synthesis-oriented

AI Analysis

This addresses a domain-specific gap in OCR technology for practical applications involving numerical documents, but it is incremental as it focuses on evaluation rather than a new denoising method.

The paper tackled the problem of evaluating OCR post-processing difficulty for numerical texts like invoices and payslips, showing that such texts have a significant disadvantage in denoising complexity compared to natural alphabetical words.

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

View on arXiv PDF

Similar