CVOct 14, 2025

On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

arXiv:2510.12660v2h-index: 5Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This work addresses computational efficiency for researchers and practitioners in computer vision, but it is incremental as it builds on existing hierarchical models.

The paper tackled the problem of high computational cost in human mesh recovery and pose estimation by proposing truncated hierarchical vision foundation models as encoders, achieving performance comparable to full models while improving efficiency, with evaluations showing better accuracy-efficiency trade-offs across 27 models.

In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives. The source code is available at https://github.com/nttcom/TruncHierVFM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes