CVMar 30

HandX: Scaling Bimanual Motion and Interaction Generation

arXiv:2603.2876674.4h-index: 7
AI Analysis

This work addresses a domain-specific problem for researchers in human motion synthesis, focusing on improving bimanual hand interactions, and is incremental as it builds on existing methods with new data and annotations.

The paper tackles the problem of generating realistic bimanual hand motion and interaction, which is underexplored in human motion synthesis, by introducing HandX—a unified framework with a new dataset, annotation method, and evaluation metrics. The result shows that larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion, as demonstrated through experiments with diffusion and autoregressive models.

Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes