CLFeb 27, 2022

OCR Improves Machine Translation for Low-Resource Languages

Oana Ignat, Jean Maillard, Vishrav Chaudhary, Francisco Guzmán

arXiv:2202.13274v232.0639 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving machine translation for low-resource languages by leveraging OCR data, though it is incremental as it builds on existing backtranslation methods.

The researchers tackled the problem of OCR performance on low-resource languages and scripts by creating a novel benchmark (OCR4MT) for 60 such languages, showing that OCR-generated monolingual data can improve Machine Translation models through backtranslation, with an ablation study determining the minimum OCR quality needed for usefulness.

We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts. We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that OCR monolingual data is a valuable resource that can increase performance of Machine Translation models, when used in backtranslation. We then perform an ablation study to investigate how OCR errors impact Machine Translation performance and determine what is the minimum level of OCR quality needed for the monolingual data to be useful for Machine Translation.

View on arXiv PDF

Similar