CLFeb 26, 2025

Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Aloka Fernando, Nisansa de Silva, Menan Velyuthan, Charitha Rathnayake, Surangika Ranathunga

arXiv:2502.19074v38.33 citationsh-index: 14EMNLP

Originality Incremental advance

AI Analysis

This work addresses data quality issues for low-resource language translation, but it is incremental as it builds on existing parallel data curation techniques.

The paper tackled the problem of noise in web-mined parallel corpora for low-resource languages by showing that biases in multilingual pre-trained language models cause disparities in Neural Machine Translation quality, and it improved results by applying debiasing heuristics to filter noisy sentences.

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets.

View on arXiv PDF

Similar