CVJun 12, 2025

Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

arXiv:2506.10816v13 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of accurate 3D hand-object pose estimation for robotics and AR/VR applications, but it is incremental as it builds on existing masked autoencoder techniques.

The paper tackles the challenge of hand-object pose estimation from monocular RGB images by addressing severe occlusions, proposing HOMAE, a method based on masked autoencoders with a target-focused masking strategy and multi-scale feature integration, achieving state-of-the-art performance on DexYCB and HO3Dv2 benchmarks.

Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes