LGJun 20, 2025

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt

arXiv:2506.17417v23 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses the problem of enhancing VLM performance for AI researchers, revealing incremental insights by showing that existing inference-time techniques are less effective due to verification weaknesses.

The paper investigates whether inference-time scaling methods, which improve reasoning in large language models, similarly benefit vision-language models (VLMs), finding that majority vote outperforms verification-centric strategies and RL-trained VLMs show weak self-verification, limiting performance gains.

Inference-time techniques such as decoding-time scaling and self-refinement have been shown to substantially improve reasoning in large language models (LLMs), driven by emergent self-correction and self-verification behaviors often elicited through reinforcement learning (RL). In this work, we investigate whether these inference-time scaling methods similarly benefit vision-language models (VLMs), especially those fine-tuned with RL. Through extensive evaluation, we find that while strategies like majority vote and best-of-N with self-verification enhance VLM performance, majority vote significantly outperforms verification-centric ones. Furthermore, inference time scaling behaviors commonly associated with RL-tuned models, such as the 'A-ha moment,' do not yield consistent performance gains. Our analysis identifies a key limitation: current RL-trained VLMs exhibit weak self-verification across both visual and textual modalities, limiting the effectiveness of inference-time scaling.

View on arXiv PDF

Similar