CVLGDec 12, 2023

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Georgia TechMicrosoft
arXiv:2312.07509v276 citationsh-index: 34CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of limited user interactivity in video generation for creative applications, offering a novel method that is incremental but impactful.

The paper tackles the lack of interactive control in video generation models by introducing Peekaboo, a masked attention module that enables spatio-temporal control without extra training or inference overhead, achieving up to a 3.8x improvement in mIoU over baselines while maintaining latency.

Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes