LGDec 11, 2025

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah

arXiv:2512.11109v11 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of optimizing inference-time computation for vision-language reasoning, providing insights for researchers and practitioners, but it is incremental as it extends existing TTS methods to multimodal systems without introducing new paradigms.

The study systematically evaluated test-time scaling (TTS) for vision-language models, finding that closed-source models benefit from structured reasoning and self-refinement, while open-source models show inconsistent gains, with effectiveness varying by dataset, such as clear improvements on multi-step reasoning tasks but limited gains on perception-focused benchmarks.

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

View on arXiv PDF

Similar