CLJan 28

When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

arXiv:2601.20858v11 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the problem of inflated evaluation scores due to benchmark contamination for researchers and practitioners in machine translation, highlighting a critical issue in model assessment.

The study investigated cross-direction contamination in machine translation evaluation, showing that models trained on benchmarks like FLORES-200 can artificially boost performance in unseen translation directions due to memorization, with named entity replacement effectively probing this issue and causing a consistent BLEU score decrease.

Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes