LGSep 1, 2025

Hierarchical Motion Captioning Utilizing External Text Data Source

arXiv:2509.01471v1h-index: 5
Originality Incremental advance
AI Analysis

It enhances motion captioning for applications like robotics and animation by leveraging external text data, though it is incremental as it builds on existing methods.

This paper tackles the problem of motion captioning by addressing the lack of low-level descriptions in existing datasets, using a hierarchical approach that improves accuracy by 6% to 50% over state-of-the-art methods on three datasets.

This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., ``a person doing jumping jacks"). The existing methods require motion data annotated with high-level descriptions (e.g., ``jumping jacks"). However, such data is rarely available in existing motion-text datasets, which additionally do not include low-level motion descriptions. To address this, we propose a two-step hierarchical approach. First, we employ large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets (e.g., ``jumping while synchronizing arm extensions with the opening and closing of legs" for ``jumping jacks"). These refined annotations are used to retrain motion-to-text models to produce captions with low-level details. Second, we introduce a pioneering retrieval-based mechanism. It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions. Our methodology is distinctive in its ability to harness knowledge from external text sources to greatly increase motion captioning accuracy, especially for movements not covered in existing motion-text datasets. Experiments on three distinct motion-text datasets (HumanML3D, KIT, and BOTH57M) demonstrate that our method achieves an improvement in average performance (across BLEU-1, BLEU-4, CIDEr, and ROUGE-L) ranging from 6% to 50% compared to the state-of-the-art M2T-Interpretable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes