CLMay 26

Rethinking the Multilingual Reasoning Gap with Layer Swap

Maxence Lasbordes, Amélie Chatelain, Djamé Seddah

arXiv:2605.2673574.8

AI Analysis

For multilingual NLP practitioners, this work provides a simple method to reduce the performance gap between native and English-pivoted reasoning without requiring native-language training data.

The authors show that the native reasoning gap in multilingual LLMs shrinks to 1.9–3.5% under comparable supervision, and propose a Layer Swap method that transfers English reasoning mid-layers to close most of the gap while preserving target-language CoT.

Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.

View on arXiv PDF

Similar