Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy
This addresses the challenge of robust pose estimation in environments with textureless or similar surfaces, which is crucial for robotics and AR applications, though it is incremental as it builds on existing fusion strategies.
The paper tackles the problem of 6D object pose estimation from RGB images by proposing a framework that learns implicit 3D information from two RGB views, outperforming state-of-the-art RGB-based methods and achieving results comparable to RGBD-based methods.
Current RGB-based 6D object pose estimation methods have achieved noticeable performance on datasets and real world applications. However, predicting 6D pose from single 2D image features is susceptible to disturbance from changing of environment and textureless or resemblant object surfaces. Hence, RGB-based methods generally achieve less competitive results than RGBD-based methods, which deploy both image features and 3D structure features. To narrow down this performance gap, this paper proposes a framework for 6D object pose estimation that learns implicit 3D information from 2 RGB images. Combining the learned 3D information and 2D image features, we establish more stable correspondence between the scene and the object models. To seek for the methods best utilizing 3D information from RGB inputs, we conduct an investigation on three different approaches, including Early- Fusion, Mid-Fusion, and Late-Fusion. We ascertain the Mid- Fusion approach is the best approach to restore the most precise 3D keypoints useful for object pose estimation. The experiments show that our method outperforms state-of-the-art RGB-based methods, and achieves comparable results with RGBD-based methods.