CV RODec 27, 2025

Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

arXiv:2512.22464v1h-index: 3

Originality Incremental advance

AI Analysis

This work provides an incremental improvement for users in animation and virtual reality by enhancing interpretable motion control and editability.

The paper tackles the problem of generating and editing 3D motions from text by addressing limitations in pose-code-based frameworks that struggle with temporal dynamics and details, resulting in improved Fréchet inception distance and reconstruction metrics on HumanML3D and KIT-ML datasets compared to baselines like CoMo.

Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

View on arXiv PDF

Similar