Sample Complexity and Representation Ability of Test-time Scaling Paradigms
This work provides foundational theoretical insights into sample efficiency and representation ability for test-time strategies in LLMs, addressing a key bottleneck in AI research.
The paper tackled the theoretical understanding of test-time scaling paradigms for large language models, establishing that self-consistency requires Θ(1/Δ²) samples and best-of-n requires Θ(1/Δ) for correct answers, and showing that self-correction enables Transformers to simulate online learning for multi-task settings.
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $Θ(1/Δ^2)$ samples to produce the correct answer, while best-of-$n$ only needs $Θ(1/Δ)$, where $Δ< 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.