CVAIMar 18, 2025

Fast Autoregressive Video Generation with Diagonal Decoding

arXiv:2503.14070v16 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the problem of slow inference for researchers and practitioners using autoregressive video generation models, though it is incremental as it builds on existing pre-trained models.

The paper tackled the bottleneck of slow sequential decoding in autoregressive Transformer models for video generation by proposing Diagonal Decoding (DiagD), a training-free inference acceleration algorithm that exploits spatial and temporal correlations, achieving up to 10x speedup while maintaining comparable visual fidelity.

Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes