PFAISep 23, 2025

Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

arXiv:2509.19645v11 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses the need for more realistic scaling evaluations in large language models, which is important for researchers and practitioners deploying models in real-world systems, though it is incremental as it builds on existing scaling concepts.

The paper tackles the problem that current test-time scaling methods focus on compute-optimal metrics, ignoring practical system factors like latency and cost-per-token, and finds that these methods have limitations when evaluated holistically.

Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes