CVDec 10, 2024

CoMA: Compositional Human Motion Generation with Multi-modal Agents

arXiv:2412.07320v217 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses motion generation limitations for applications requiring detailed human animations, though it appears incremental as it builds on existing agent and transformer approaches.

The paper tackles the problem of generating complex 3D human motions unseen in training data by introducing CoMA, an agent-based framework that uses collaborative agents with large language and vision models alongside a specialized motion generator. Results show competitive performance on HumanML3D and significant outperformance on new compositional prompts in user studies.

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes