D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
This addresses efficiency and performance issues in SLMs for reasoning tasks, offering a practical improvement over existing CoT methods.
The paper tackles the problem of 'overthinking' in Small Language Models (SLMs) during Chain-of-Thought (CoT) distillation from Large Language Models (LLMs), which causes performance degradation and high token usage, by proposing D-CoT, a framework that uses control tags to enforce structured reasoning, resulting in a 9.9% accuracy boost on GPQA-diamond and 9.1% on MMLU-Pro (0-shot) with reduced computational costs.
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.