CVRODec 22, 2025

Zero-shot Reconstruction of In-Scene Object Manipulation from Video

arXiv:2512.19684v11 citationsh-index: 22
Originality Highly original
AI Analysis

This addresses a practical problem for robotics and AR/VR applications by enabling accurate 3D reconstruction of object manipulations in real-world scenes.

The paper tackles the problem of reconstructing in-scene object manipulation from monocular RGB video, achieving the first system that recovers complete hand-object motion with metric accuracy and physical plausibility.

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes