CL LGNov 4, 2025

Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model

Cristian García-Romero, Miquel Esplà-Gomis, Felipe Sánchez-Martínez

arXiv:2511.02958v12.7

Originality Highly original

AI Analysis

This addresses a critical data quality issue for machine translation systems, which rely on large parallel corpora, by providing an effective filtering method to improve translation performance.

The paper tackles the problem of detecting machine-generated translations in training data to prevent degradation of machine translation quality, achieving at least 5 percentage points accuracy gain over state-of-the-art methods, especially for non-English language pairs.

Modern machine translation (MT) systems depend on large parallel corpora, often collected from the Internet. However, recent evidence indicates that (i) a substantial portion of these texts are machine-generated translations, and (ii) an overreliance on such synthetic content in training data can significantly degrade translation quality. As a result, filtering out non-human translations is becoming an essential pre-processing step in building high-quality MT systems. In this work, we propose a novel approach that directly exploits the internal representations of a surrogate multilingual MT model to distinguish between human and machine-translated sentences. Experimental results show that our method outperforms current state-of-the-art techniques, particularly for non-English language pairs, achieving gains of at least 5 percentage points of accuracy.

View on arXiv PDF

Similar