ROCVNov 25, 2025

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

arXiv:2511.20330v2
Originality Incremental advance
AI Analysis

This addresses the challenge of interactive articulated manipulation for robotics and AI systems, with incremental improvements in benchmarking and framework design.

The paper tackles the problem of generalizing vision-language policies for articulated object manipulation by introducing ArtiBench, a five-level benchmark, and ArtiBrain, a modular framework that outperforms state-of-the-art methods in robustness and generalization.

Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes