LGNov 25, 2025

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

arXiv:2511.19942v210 citations
Originality Highly original
AI Analysis

This addresses a critical issue in LLM fine-tuning for researchers and practitioners, offering a principled solution over ad-hoc heuristics.

The paper tackles the problem of diversity collapse in reinforcement learning fine-tuning of large language models, introducing differential smoothing to provably improve both correctness and diversity, with experiments showing up to 6.7% improvements on the AIME24 dataset.

It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- differential smoothing -- that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes