Exploring Motion-Language Alignment for Text-driven Motion Generation
This work addresses the problem of generating realistic human motions from text descriptions for applications in animation or robotics, representing an incremental improvement with novel attention regulation strategies.
The paper tackled the challenge of aligning motion dynamics with textual semantics in text-driven human motion generation by proposing MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning, and it consistently improved motion quality and alignment over baselines in experiments.
Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.