CVMay 12

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

arXiv:2605.1170496.2
Predicted impact top 7% in CV · last 90 daysOriginality Highly original
AI Analysis

This work improves text-to-motion generation for researchers and practitioners, offering a new coarse-to-fine paradigm with better fidelity and enabling training-free motion editing.

ScaleMoGen introduces a scale-wise autoregressive framework for text-driven human motion generation, achieving state-of-the-art FID of 0.030 on HumanML3D and CLIP Score of 0.693 on SnapMoGen, outperforming prior methods like MoMask and MoMask++.

We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes