CVIRMar 10

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

arXiv:2603.09930v112.51 citationsh-index: 2Has Code
Predicted impact top 50% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work improves accuracy and interpretability for researchers and practitioners in human motion analysis, though it is incremental as it builds on existing retrieval frameworks.

The paper tackles the problem of text-motion retrieval by addressing the loss of fine-grained local correspondences in existing dual-encoder methods, proposing an interpretable joint-angle motion representation and enhanced token-wise interaction to achieve state-of-the-art performance on HumanML3D and KIT-ML datasets.

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes