CV LGNov 11, 2024

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Junho Kim, Hyungjin Chung, Byung-Hoon Kim

arXiv:2411.06869v25.23 citationsh-index: 3Has Code

Originality Highly original

AI Analysis

This addresses the cumbersome reliance on support images in pose estimation for diverse object categories, representing a significant advancement rather than an incremental improvement.

The paper tackles the problem of category-agnostic pose estimation by introducing CapeLLM, a multimodal large language model that uses only query images and text descriptions without support images, achieving state-of-the-art results on the MP-100 benchmark in 1-shot and 5-shot settings.

Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pre-trained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art on the MP-100 benchmark in the 1-shot and even 5-shot setting, marking a significant advancement in the field of category-agnostic pose estimation. Code is available at https://github.com/Junhojuno/CapeLLM.

View on arXiv PDF Code

Similar