CLMay 8

Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation

arXiv:2605.0753355.8

AI Analysis

For researchers and practitioners using LLMs for machine translation, this work identifies token-level dynamics as a key factor in translation failures, particularly for low-resource languages.

The paper analyzes failure modes of LLMs in machine translation across 22 language pairs, finding that non-English-centric pairs yield lower COMET scores. It introduces Token Activation Rate (TAR) as a metric that correlates with translation performance, showing that lower TAR is strongly associated with poorer translation.

Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.

View on arXiv PDF

Similar