CVOct 13, 2025

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

arXiv:2510.10868v13 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks for researchers and practitioners using 3D human pose estimation models, though it is incremental as it builds on existing transformer architectures.

The paper tackles the high computational cost of transformer-based 3D Human Mesh Recovery models by introducing two merging strategies (Error-Constrained Layer Merging and Mask-guided Token Merging) and a diffusion-based decoder, achieving up to 2.3x speed-up while slightly improving performance over the baseline.

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes