AIMay 27, 2025

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

arXiv:2505.21067v113 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient reasoning enhancement in AI, but it is incremental as it builds on existing distillation and RL methods.

The paper tackled the problem of improving reasoning in large language models by showing that a simple distillation method using only 920 examples outperforms zero-RL, which requires more data and cost, with the distilled model achieving more flexible reasoning as evidenced by increased use of anthropomorphic tokens and logical connectors.

Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes