DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation
This addresses the problem of handling multiple and novel objects in 6D pose estimation for robotics and AR/VR applications, with incremental improvements in scalability.
The paper tackles scalable 6D pose estimation from RGB images by disentangling shape and pose in an auto-encoder framework, achieving state-of-the-art performance on benchmarks with textureless CAD and daily objects, and demonstrating improved scalability across categories.
Scalable 6D pose estimation for rigid objects from RGB images aims at handling multiple objects and generalizing to novel objects. Building on a well-known auto-encoding framework to cope with object symmetry and the lack of labeled training data, we achieve scalability by disentangling the latent representation of auto-encoder into shape and pose sub-spaces. The latent shape space models the similarity of different objects through contrastive metric learning, and the latent pose code is compared with canonical rotations for rotation retrieval. Because different object symmetries induce inconsistent latent pose spaces, we re-entangle the shape representation with canonical rotations to generate shape-dependent pose codebooks for rotation retrieval. We show state-of-the-art performance on two benchmarks containing textureless CAD objects without category and daily objects with categories respectively, and further demonstrate improved scalability by extending to a more challenging setting of daily objects across categories.