CLMay 28, 2025

Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate

arXiv:2505.21999v12 citationsh-index: 38IJCNLP-AACL
Originality Incremental advance
AI Analysis

This addresses the need for efficient and consistent cross-lingual evaluations in LLMs, which is crucial for practitioners in AI and NLP, though it is incremental as it builds on existing translation-based methods.

The paper tackled the problem of evaluating multilingual consistency in large language models (LLMs) by proposing a Translate then Evaluate framework, revealing pronounced inconsistencies across thirty languages with severe deficits in certain language families and scripts.

Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes