CVFeb 4

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

arXiv:2602.04188v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses motion generation and understanding for applications like animation and robotics, presenting a unified approach that is incremental over prior masked modeling methods.

The authors tackled the problem of generating and understanding human motion from text by introducing DiMo, a discrete diffusion framework that unifies text-to-motion, motion-to-text, and motion-to-motion tasks in a single model, achieving strong motion quality and competitive bidirectional understanding on HumanML3D and KIT-ML datasets.

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps.We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change.Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes