Limits and Gains of Test-Time Scaling in Vision-Language Reasoning
This work addresses the problem of optimizing inference-time computation for vision-language reasoning, providing insights for researchers and practitioners, but it is incremental as it extends existing TTS methods to multimodal systems without introducing new paradigms.
The study systematically evaluated test-time scaling (TTS) for vision-language models, finding that closed-source models benefit from structured reasoning and self-refinement, while open-source models show inconsistent gains, with effectiveness varying by dataset, such as clear improvements on multi-step reasoning tasks but limited gains on perception-focused benchmarks.
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.