ROMar 18

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

arXiv:2603.1772089.6h-index: 11
Predicted impact top 10% in RO · last 90 daysOriginality Highly original
AI Analysis

This addresses robustness and spatial reasoning issues in robotic manipulation, offering substantial performance gains in simulation and real-world generalization, though it is incremental in improving existing imitation learning methods.

The paper tackles the 2D-3D mismatch in visual imitation learning for robotic manipulation by introducing VolumeDP, a policy architecture that uses volumetric representation for spatial reasoning, achieving a state-of-the-art 88.8% success rate on the LIBERO benchmark with a 14.8% improvement over baselines.

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes