Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video
For computer vision researchers, this work provides a more robust and generalizable solution to articulated object understanding from casual videos, reducing reliance on complex video setups.
The paper tackles articulated object 3D kinematics estimation from a single casual video, proposing a category-agnostic optimization framework that treats the problem as primitive fitting. The method outperforms existing approaches on new benchmarks with heavy occlusions and camera motion.
Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/