CL LGMay 26

Disentangling Language Roles in Multilingual LLM Task Execution

Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

arXiv:2605.2764978.6h-index: 1

Predicted impact top 70% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers of multilingual LLMs, this work provides a systematic framework to isolate and measure the impact of each language role on task execution, revealing that response-language alignment is critical.

The paper introduces MTM-Bench, a controlled benchmark that disentangles three language roles (instruction, content, response) in multilingual LLM task execution. Evaluating 20 LLMs across 27 language triplets reveals that response-language mismatch is the dominant source of degradation, and mismatch count is not a monotonic predictor of difficulty.

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

View on arXiv PDF

Similar