CVJul 9, 2025

MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

arXiv:2507.06590v15 citationsh-index: 5IEEE Trans Vis Comput Graph
Originality Highly original
AI Analysis

This addresses the problem of motion generation from rare text for applications like animation or robotics, representing a novel method for a known bottleneck.

The paper tackles the challenge of generating human motion from rare language prompts by introducing MOST, a motion diffusion model that uses temporal clip Banzhaf interaction for fine-grained text-to-motion matching, achieving state-of-the-art performance in retrieval and generation.

We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes