CLLGJan 12, 2024

Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation

arXiv:2401.06688v231 citationsh-index: 7ACL
Originality Incremental advance
AI Analysis

This work addresses translation quality for users of machine translation systems, offering an incremental improvement over existing methods like beam search and reranking.

The paper tackles the problem of neural machine translation outputs not aligning with human preferences by introducing QE-fusion, a method that combines translation hypotheses using quality estimation metrics, resulting in consistent improvements in COMET and BLEURT scores across multiple models and language pairs.

Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes