PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment
This work addresses the challenge of high-precision localization in language-guided pose estimation for computer vision applications, representing an incremental improvement over existing methods.
The paper tackled the problem of limited generalization in human pose estimation by proposing PoseLLM, a framework that replaces a linear projector with a nonlinear MLP connector to enhance vision-language fusion, achieving 77.8 AP on COCO and outperforming LocLLM by +0.4 AP while maintaining zero-shot generalization.
Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.