CVJan 21, 2025

Taming Teacher Forcing for Masked Autoregressive Video Generation

Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum

Tsinghua

arXiv:2501.12389v126.126 citationsh-index: 32CVPR

Originality Incremental advance

AI Analysis

This work addresses scalable, high-quality video generation for applications like media production, though it appears incremental as it builds on existing autoregressive and masked modeling techniques.

The paper tackled the problem of generating long, coherent video sequences by introducing MAGI, a hybrid framework that combines masked and causal modeling, with Complete Teacher Forcing (CTF) improving FVD scores by +23% over Masked Teacher Forcing for first-frame conditioned video prediction.

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

View on arXiv PDF

Similar