Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference
This addresses the adaptability issue for VLM agents in dynamic digital environments like the web, offering a novel inference-time improvement approach.
The paper tackles the problem of vision-language model agents struggling with fast-changing environments by introducing a method that enhances agent policies at inference without retraining, resulting in significant success rate improvements from 38.8% to 55.7% and 82.4% to 88.8% on the WebVoyager benchmark.
Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM's role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.