Generative 6D Pose Estimation via Conditional Flow Matching
This work addresses the problem of accurate 6D pose estimation for robotics and computer vision applications, representing an incremental advance by combining generative modeling with existing techniques.
The paper tackles 6D pose estimation by formulating it as a conditional flow matching problem, introducing Flose, which integrates appearance-based features to address object symmetries and achieves an average improvement of +4.5 Average Recall on five BOP benchmark datasets.
Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\mathbb{R}^3$. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : https://tev-fbk.github.io/Flose/