VisualActBench: Can VLMs See and Act like a Human?
This addresses the challenge of developing vision-centric AI agents that can act like humans in real-world scenarios, though it is incremental as it focuses on benchmarking rather than proposing a new method.
The paper tackles the problem of Vision-Language Models (VLMs) lacking proactive reasoning and action based on visual inputs alone, introducing the Visual Action Reasoning task and VisualActBench benchmark with 1,074 videos and 3,733 actions, and finds that frontier models like GPT4o show a significant gap compared to human-level reasoning, especially in generating proactive, high-priority actions.
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.