ROCVMar 8

RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation

arXiv:2603.07691v1
Predicted impact top 17% in RO · last 90 daysOriginality Highly original
AI Analysis

This work addresses the problem of inconsistent contact region and pose predictions for robot manipulation, which can lead to task failures. It is significant for roboticists developing more robust and generalizable manipulation systems.

This paper introduces RoboPCA, a framework that jointly predicts contact regions and poses for robot manipulation, addressing the common issue of inconsistencies when these are predicted separately. It also presents Human2Afford, a data pipeline that automatically extracts 3D scene information and pose-centered affordance annotations from human demonstrations, enabling scalable data collection. RoboPCA, using RGB-D and mask-enhanced features, outperforms baseline methods in various settings and shows strong generalization.

Understanding spatial affordances -- comprising the contact regions of object interaction and the corresponding contact poses -- is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes