CVApr 4

Next-Scale Autoregressive Models for Text-to-Motion Generation

arXiv:2604.0379980.1
Predicted impact top 28% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses the misalignment between standard next-token prediction and temporal structure in text-conditioned motion generation, offering a more efficient and scalable approach for the computer vision and graphics community.

MoScale introduces a next-scale autoregressive framework that generates motion hierarchically from coarse to fine temporal resolutions, achieving state-of-the-art text-to-motion performance with high training efficiency and zero-shot generalization to diverse tasks.

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes