MarkovScale: Towards Optimal Sequential Scaling at Inference Time

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

arXiv:2602.01120v11.4

Originality Highly original

AI Analysis

This work addresses the challenge of achieving optimal and resource-efficient inference in large language models, representing a significant but incremental improvement over existing scaling methods.

The paper tackles the problem of suboptimal performance in sequential scaling for LLM inference by proposing a principled framework based on a two-state Markov process, which reveals conditions for accuracy improvement and theoretical bounds, leading to MarkovScale that outperforms state-of-the-art methods across multiple benchmarks and configurations.

Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.

View on arXiv PDF

Similar