AI CL LG MLDec 2, 2024

Self-Improvement in Language Models: The Sharpening Mechanism

Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy

arXiv:2412.01951v234.595 citationsh-index: 15

Originality Highly original

AI Analysis

This provides a foundational framework for designing self-improvement algorithms in language models, addressing a key theoretical gap in AI research.

The paper tackles the problem of understanding why language models can improve through self-evaluation and refinement, proposing a 'sharpening' mechanism where models use themselves as verifiers to focus on high-quality sequences. It establishes theoretical limits, analyzes SFT and RLHF algorithms, and empirically validates that RLHF can outperform SFT by enabling online exploration.

Recent work in language modeling has raised the possibility of self-improvement, where a language models evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ``sharpen'' the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner aims to sharpen a pre-trained base policy via sample access, and establish fundamental limits. Then we analyze two natural families of self-improvement algorithms based on SFT and RLHF. We find that (i) the SFT-based approach is minimax optimal whenever the initial model has sufficient coverage, but (ii) the RLHF-based approach can improve over SFT-based self-improvement by leveraging online exploration, bypassing the need for coverage. Finally, we empirically validate the sharpening mechanism via inference-time and amortization experiments. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms.

View on arXiv PDF

Similar