CVJul 12, 2025

PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

arXiv:2507.09139v13 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of high-precision localization in language-guided pose estimation for computer vision applications, representing an incremental improvement over existing methods.

The paper tackled the problem of limited generalization in human pose estimation by proposing PoseLLM, a framework that replaces a linear projector with a nonlinear MLP connector to enhance vision-language fusion, achieving 77.8 AP on COCO and outperforming LocLLM by +0.4 AP while maintaining zero-shot generalization.

Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes