CVSep 28, 2025

MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, Shiguang Shan

arXiv:2509.23635v113.15 citationsh-index: 12

Originality Highly original

AI Analysis

This work addresses the challenge of integrating motion and language modalities for AI applications in animation or robotics, representing a novel method for a known bottleneck.

The paper tackles the problem of human motion comprehension, generation, and editing by proposing MotionVerse, a unified framework that uses Large Language Models with motion tokenization and a delay parallel modeling strategy, achieving superior performance across various motion tasks.

This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

View on arXiv PDF

Similar