CVAIJan 13, 2025

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

arXiv:2501.07647v115 citationsh-index: 26CVPR
Originality Incremental advance
AI Analysis

This addresses the need for better controllability in video generation for users requiring precise object motion and appearance, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of text-to-video generation models struggling with complex prompts and multiple objects by proposing a blob video representation for improved controllability, resulting in superior zero-shot generation and state-of-the-art layout controllability on benchmarks, with compositional accuracy outperforming proprietary models when combined with an LLM.

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes