GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
This work improves motion generation for applications like animation and robotics by integrating diverse datasets, though it is incremental in combining existing techniques like VQ-VAE and transformers.
The paper tackles the problem of text-conditional human motion generation by addressing data heterogeneity from multi-source datasets, proposing GenM^3, which achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark and demonstrates strong zero-shot generalization on the IDEA400 dataset.
Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM\(^3\)), a comprehensive framework designed to learn unified motion representations. GenM\(^3\) comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM\(^3\) achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.