CL AIJul 2, 2024

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

arXiv:2407.02208v29.613 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses noise handling in machine translation for practitioners using web data, offering an incremental improvement over existing methods.

The paper tackles the problem of semantic misalignment noise in web-mined parallel data for machine translation, proposing a self-correction method that improves translation performance on both simulated and real-world noisy datasets.

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. With an observation of the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction, an approach that gradually increases trust in the model's self-knowledge to correct the training supervision. Comprehensive experiments show that our method significantly improves translation performance both in the presence of simulated misalignment noise and when applied to real-world, noisy web-mined datasets, across a range of translation tasks.

View on arXiv PDF

Similar