CL AIJul 23, 2024

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ifeoluwa Adelani, Eduardo Sánchez

arXiv:2407.16470v316.229 citationsh-index: 34Has Code

Originality Incremental advance

AI Analysis

It addresses the critical challenge of improving user trust in machine translation systems by enhancing hallucination detection, though it is incremental as it applies existing LLMs to a known problem with limited gains for low-resource languages.

This paper tackles the problem of detecting hallucinations in machine translation, particularly for low-resource languages, by evaluating large language models and semantic similarity across 16 language directions, finding that Llama3-70B outperforms previous state-of-the-art by up to 0.16 MCC for high-resource languages, while Claude Sonnet shows a smaller gain of 0.03 MCC for low-resource languages.

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

View on arXiv PDF Code

Similar