CV CLNov 27, 2023

Data Generation for Post-OCR correction of Cyrillic handwriting

Evgenii Davydkin, Aleksandr Markelov, Egor Iuldashev, Anton Dudkin, Ivan Krivorotov

arXiv:2311.15896v14 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a domain-specific problem for researchers and practitioners in OCR and handwriting analysis, focusing on Cyrillic scripts, and is incremental as it applies existing methods (Bézier curves and T5) to a new data generation task.

The paper tackles the lack of large datasets for post-OCR correction of handwritten Cyrillic text by developing a synthetic handwriting generation engine using Bézier curves to create realistic text, which is then used to train a T5-based correction model, achieving improved Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) on real datasets like HWR200 and School_notebooks_RU.

This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in https://github.com/dbrainio/CyrillicHandwritingPOC

View on arXiv PDF Code

Similar