CVAug 18, 2022

COPE: End-to-end trainable Constant Runtime Object Pose Estimation

Stefan Thalhammer, Timothy Patten, Markus Vincze

arXiv:2208.08807v210.116 citationsh-index: 44

Originality Incremental advance

AI Analysis

This addresses the scalability and speed issues in object pose estimation for robotics and AR/VR applications, though it is an incremental improvement over existing geometric-correspondence methods.

The paper tackles the problem of slow and non-scalable multi-stage object pose estimation by introducing an end-to-end trainable method that directly regresses 6D poses for all instances in an image, achieving superior performance and being over 35 times faster than state-of-the-art approaches.

State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do not scale well with the number of object instances involved. Recent approaches show that direct 6D object pose estimation is feasible when derived from the aforementioned geometric correspondences. We present an approach that learns an intermediate geometric representation of multiple objects to directly regress 6D poses of all instances in a test image. The inherent end-to-end trainability overcomes the requirement of separately processing individual object instances. By calculating the mutual Intersection-over-Unions, pose hypotheses are clustered into distinct instances, which achieves negligible runtime overhead with respect to the number of object instances. Results on multiple challenging standard datasets show that the pose estimation performance is superior to single-model state-of-the-art approaches despite being more than ~35 times faster. We additionally provide an analysis showing real-time applicability (>24 fps) for images where more than 90 object instances are present. Further results show the advantage of supervising geometric-correspondence-based object pose estimation with the 6D pose.

View on arXiv PDF

Similar