Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
This work addresses the issue of unreliable translations in multilingual LLMs for users and developers, providing a forward-looking testbed, though it is incremental as it builds on existing MT benchmarking efforts.
The authors tackled the problem of hallucinations in multilingual large language models (LLMs) for machine translation by introducing a diagnostic framework with a taxonomy and HalloMTBench, a human-verified benchmark across 11 language directions, resulting in the curation of 5,435 instances and identification of distinct hallucination triggers such as model scale and linguistic biases.
Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.