CVROMay 20

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

arXiv:2605.2171441.5
Predicted impact top 90% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For 3D hand tracking in egocentric vision, AVI-HT provides a robust solution to visual occlusion by adaptively fusing IMU signals, achieving significant accuracy improvements.

AVI-HT adaptively fuses egocentric vision and on-glove IMU signals for 3D hand tracking, reducing mean keypoint error by 16.1% and wrist-aligned error by 24.2% over baselines, especially under heavy visual occlusion in hand-object interaction scenarios.

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes