CVDec 10, 2022

MAGVIT: Masked Generative Video Transformer

CMUDeepMind
arXiv:2212.05199v2383 citationsh-index: 70
Originality Highly original
AI Analysis

This addresses the problem of efficient and high-quality video generation for AI and multimedia applications, representing a significant advance rather than an incremental improvement.

The paper tackles video synthesis by introducing MAGVIT, a masked generative video transformer that uses a 3D tokenizer and masked token modeling to handle multiple tasks with a single model, achieving state-of-the-art FVD on benchmarks like Kinetics-600 and up to 60x faster inference than existing methods.

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes