CVJul 31, 2025

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang, Weixin Si, Shihao Zou

arXiv:2507.23188v13.6h-index: 3IEEE transactions on multimedia

Originality Incremental advance

AI Analysis

This work addresses motion acquisition for applications like animation or gaming by enhancing retrieval precision and user interaction through multi-modal integration, though it is incremental as it builds on existing contrastive learning methods.

The paper tackled motion retrieval by proposing a framework that aligns text, audio, video, and motion in a fine-grained joint embedding space, achieving improvements such as a 10.16% increase in R@10 for text-to-motion retrieval and a 25.43% increase in R@1 for video-to-motion retrieval on the HumanML3D dataset.

Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

View on arXiv PDF

Similar