CVFeb 24

From Perception to Action: An Interactive Benchmark for Vision Reasoning

arXiv:2602.21015v12 citationsh-index: 77
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in evaluating embodied agents and interactive applications, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the problem of evaluating vision-language models' ability to reason about physical constraints in dynamic environments by introducing the CHAIN benchmark, which revealed that top-performing models struggle with internalizing physical structure and causal constraints, often failing in long-horizon planning and action execution.

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes