AI CLMay 20, 2025

Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

Minwu Kim, Anubhav Shrestha, Safal Shrestha, Aadim Nepal, Keith Ross

arXiv:2505.14216v219.512 citationsh-index: 3Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of optimizing LLM reasoning for researchers and practitioners, but it is incremental as it builds on known phenomena without introducing new methods.

The paper investigates why reinforcement learning with verifiable rewards (RLVR) improves overall accuracy but not capability in LLM reasoning tasks, while distillation can improve both, finding that RLVR focuses on easier questions at the expense of harder ones and that distillation only boosts capability when new knowledge is introduced.

Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs

View on arXiv PDF Code

Similar