Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
This work addresses the problem of enhancing VLM performance for AI researchers, revealing incremental insights by showing that existing inference-time techniques are less effective due to verification weaknesses.
The paper investigates whether inference-time scaling methods, which improve reasoning in large language models, similarly benefit vision-language models (VLMs), finding that majority vote outperforms verification-centric strategies and RL-trained VLMs show weak self-verification, limiting performance gains.
Inference-time techniques such as decoding-time scaling and self-refinement have been shown to substantially improve reasoning in large language models (LLMs), driven by emergent self-correction and self-verification behaviors often elicited through reinforcement learning (RL). In this work, we investigate whether these inference-time scaling methods similarly benefit vision-language models (VLMs), especially those fine-tuned with RL. Through extensive evaluation, we find that while strategies like majority vote and best-of-N with self-verification enhance VLM performance, majority vote significantly outperforms verification-centric ones. Furthermore, inference time scaling behaviors commonly associated with RL-tuned models, such as the 'A-ha moment,' do not yield consistent performance gains. Our analysis identifies a key limitation: current RL-trained VLMs exhibit weak self-verification across both visual and textual modalities, limiting the effectiveness of inference-time scaling.