CLAILGAug 20, 2025

Long Chain-of-Thought Reasoning Across Languages

arXiv:2508.14828v212 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of multilingual reasoning gaps for AI systems, but it is incremental as it builds on existing chain-of-thought methods.

The paper investigates how long chain-of-thought reasoning capabilities transfer from English to nine non-English languages, finding that scaling model size improves performance when reasoning in English but lags in target languages, especially for multi-step tasks like mathematical reasoning, and that fine-tuning on translated English traces outperforms other data curation methods.

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes