CVMMApr 17, 2024

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

arXiv:2404.11375v224 citationsh-index: 2IEEE Transactions on Image Processing
Originality Incremental advance
AI Analysis

This work addresses a novel task in human motion understanding, enabling precise temporal grounding for applications like video analysis, but it is incremental as it builds on existing text-motion methods with a new dataset and model adaptation.

The paper tackles the problem of localizing temporal segments in untrimmed human motion sequences based on textual descriptions, introducing the Text-based Human Motion Grounding (THMG) task, and proposes TM-Mamba, a model that achieves effective performance on the new BABEL-Grounding dataset.

Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, Transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes