TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
For developers of multilingual LLMs, TLPO offers a fine-grained solution to language confusion that avoids the capability degradation seen with sequence-level methods.
TLPO introduces a token-level fine-tuning framework to mitigate language confusion in multilingual LLMs, achieving significant improvements in language consistency without degrading general task accuracy.
Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.