LGAICLMar 4, 2025

Language Models can Self-Improve at State-Value Estimation for Better Search

arXiv:2503.02878v36 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the high cost of labeled data for interactive domains like web tasks, offering a method to enhance search efficiency with open-source models, though it is incremental as it builds on existing value iteration and chain-of-thought concepts.

The paper tackles the problem of expensive ground-truth data for multi-step reasoning tasks by introducing Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions through self-supervised reasoning, resulting in a 39% boost in web agent success rates and reduced inference costs.

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes