AIFeb 17, 2025

Small Models Struggle to Learn from Strong Reasoners

UW
arXiv:2502.12143v397 citationsh-index: 12ACL
Originality Incremental advance
AI Analysis

This addresses the challenge of effectively transferring reasoning capabilities to small models for resource-constrained applications, though it is incremental as it builds on existing distillation methods.

The paper tackles the problem that small models (≤3B parameters) do not consistently benefit from long chain-of-thought reasoning or distillation from larger models, and finds they perform better with shorter, simpler reasoning chains. It proposes Mix Distillation, which combines long and short reasoning examples, significantly improving small model reasoning performance compared to training on either data alone.

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes