CVFeb 25

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

arXiv:2602.21631v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the problem of realistic hand motion modeling for applications in human-computer interaction and virtual reality, representing an incremental improvement by unifying existing tasks.

The paper tackles the challenge of modeling realistic 4D hand motion by proposing UniHand, a unified diffusion-based framework that integrates estimation and generation tasks, achieving robust performance under severe occlusions and incomplete inputs.

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes