CVJul 9, 2025

MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

Yin Wang, Mu li, Zhiying Leng, Frederick W. B. Li, Xiaohui Liang

arXiv:2507.06590v18.45 citationsh-index: 5IEEE Trans Vis Comput Graph

Originality Highly original

AI Analysis

This addresses the problem of motion generation from rare text for applications like animation or robotics, representing a novel method for a known bottleneck.

The paper tackles the challenge of generating human motion from rare language prompts by introducing MOST, a motion diffusion model that uses temporal clip Banzhaf interaction for fine-grained text-to-motion matching, achieving state-of-the-art performance in retrieval and generation.

We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.

View on arXiv PDF

Similar