Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds
This addresses the challenge of improving 3D manipulation learning efficiency for robotics researchers, though it is incremental as it builds on existing methods without altering camera setups.
The paper tackles the problem of learning robotic manipulation from 3D point clouds by investigating how input coordinate frame choices affect performance, finding that frames like end-effector and target-part improve training efficiency over common frames, and proposes FrameMiners to adaptively select and fuse frames, achieving on-par or significantly higher performance on five physical manipulation tasks.
We study how choices of input point cloud coordinate frames impact learning of manipulation skills from 3D point clouds. There exist a variety of coordinate frame choices to normalize captured robot-object-interaction point clouds. We find that different frames have a profound effect on agent learning performance, and the trend is similar across 3D backbone networks. In particular, the end-effector frame and the target-part frame achieve higher training efficiency than the commonly used world frame and robot-base frame in many tasks, intuitively because they provide helpful alignments among point clouds across time steps and thus can simplify visual module learning. Moreover, the well-performing frames vary across tasks, and some tasks may benefit from multiple frame candidates. We thus propose FrameMiners to adaptively select candidate frames and fuse their merits in a task-agnostic manner. Experimentally, FrameMiners achieves on-par or significantly higher performance than the best single-frame version on five fully physical manipulation tasks adapted from ManiSkill and OCRTOC. Without changing existing camera placements or adding extra cameras, point cloud frame mining can serve as a free lunch to improve 3D manipulation learning.