CL CVDec 14, 2024

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Harshvivek Kashid, Pushpak Bhattacharyya

arXiv:2412.15248v212.622 citationsh-index: 2Has CodeICON

Originality Incremental advance

AI Analysis

This addresses the problem of improving OCR accuracy for low-resource languages, which is incremental as it applies existing machine translation methods to a new domain.

The paper tackles the scarcity of post-OCR error correction datasets for low-resource Devanagari languages by proposing RoundTripOCR, a synthetic data generation technique, and releases datasets for six languages, while also presenting a machine translation-based approach that treats OCR errors as mistranslations to correct them using pre-trained transformer models.

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

View on arXiv PDF Code

Similar