CVJan 21, 2025

Taming Teacher Forcing for Masked Autoregressive Video Generation

Tsinghua
arXiv:2501.12389v125 citationsh-index: 32CVPR
Originality Incremental advance
AI Analysis

This work addresses scalable, high-quality video generation for applications like media production, though it appears incremental as it builds on existing autoregressive and masked modeling techniques.

The paper tackled the problem of generating long, coherent video sequences by introducing MAGI, a hybrid framework that combines masked and causal modeling, with Complete Teacher Forcing (CTF) improving FVD scores by +23% over Masked Teacher Forcing for first-frame conditioned video prediction.

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes