AI CLJan 31, 2025

SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, Sercan Ö Arık

arXiv:2501.19306v432.255 citationsh-index: 43Trans. Mach. Learn. Res.

Originality Highly original

AI Analysis

This addresses the problem of inefficient test-time scaling for researchers and practitioners using LLMs on complex tasks, offering a novel approach that is incremental in building on existing methods.

The paper tackles the inefficiency and saturation issues in test-time computation for large language models by proposing SETS, a method that combines parallel and sequential techniques with self-verification and self-correction, achieving significant performance improvements on complex reasoning tasks without model training.

Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.

View on arXiv PDF

Similar