LGAICLFeb 9, 2024

V-STaR: Training Verifiers for Self-Taught Reasoners

DeepMind
arXiv:2402.06457v2230 citationsh-index: 35
AI Analysis

This addresses a bottleneck in self-improvement for LLMs, offering a more efficient method for enhancing reasoning and code generation tasks, though it is incremental as it builds on existing self-improvement frameworks.

The paper tackles the problem of discarding incorrect solutions in self-improvement methods for large language models by proposing V-STaR, which trains a verifier using both correct and incorrect solutions to select the best candidate at inference, resulting in a 4% to 17% test accuracy improvement over existing approaches on code generation and math reasoning benchmarks.

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes