MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning
This work addresses object pose estimation for applications like warehouse automation and manufacturing, offering an incremental improvement over existing methods.
The paper tackles category-level object pose estimation by proposing MK-Pose, a multimodal framework integrating RGB images, point clouds, and textual descriptions, which outperforms state-of-the-art methods on CAMERA25 and REAL275 datasets in IoU and average precision without shape priors.
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.