AICLMay 20, 2025

Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

arXiv:2505.14216v212 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of optimizing LLM reasoning for researchers and practitioners, but it is incremental as it builds on known phenomena without introducing new methods.

The paper investigates why reinforcement learning with verifiable rewards (RLVR) improves overall accuracy but not capability in LLM reasoning tasks, while distillation can improve both, finding that RLVR focuses on easier questions at the expense of harder ones and that distillation only boosts capability when new knowledge is introduced.

Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes