CVMay 14, 2025

Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

Julian Tanke, Takashi Shibuya, Kengo Uchida, Koichi Saito, Yuki Mitsufuji

arXiv:2505.09827v110.25 citationsh-index: 152025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Incremental advance

AI Analysis

This addresses the challenge of long-term dyadic human motion synthesis for applications like animation and virtual reality, though it is incremental as it builds on existing SSM methods.

The paper tackles the problem of generating realistic dyadic human motion from text descriptions for extended interactions, introducing Dyadic Mamba, which uses State-Space Models to achieve competitive performance on short-term benchmarks and significantly outperforms transformer-based approaches on longer sequences.

Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.

View on arXiv PDF

Similar