Towards Self-Supervised Category-Level Object Pose and Size Estimation
This addresses the problem of reducing annotation costs for 3D object pose estimation, which is incremental as it builds on prior supervised methods by introducing self-supervision.
The paper tackles category-level object pose and size estimation from a single depth image without ground-truth labels, proposing a self-supervised method that enforces geometric consistency between template meshes and observed point clouds, and it outperforms traditional baselines by large margins while being competitive with some fully-supervised approaches.
In this work, we tackle the challenging problem of category-level object pose and size estimation from a single depth image. Although previous fully-supervised works have demonstrated promising performance, collecting ground-truth pose labels is generally time-consuming and labor-intensive. Instead, we propose a label-free method that learns to enforce the geometric consistency between category template mesh and observed object point cloud under a self-supervision manner. Specifically, our method consists of three key components: differentiable shape deformation, registration, and rendering. In particular, shape deformation and registration are applied to the template mesh to eliminate the differences in shape, pose and scale. A differentiable renderer is then deployed to enforce geometric consistency between point clouds lifted from the rendered depth and the observed scene for self-supervision. We evaluate our approach on real-world datasets and find that our approach outperforms the simple traditional baseline by large margins while being competitive with some fully-supervised approaches.