CVFeb 17

Consistency-Preserving Diverse Video Generation

arXiv:2602.15287v1h-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient and high-quality video generation for applications requiring varied outputs, though it is incremental as it builds on existing flow-matching models.

The paper tackled the problem of generating diverse videos from text while preserving temporal consistency, proposing a joint-sampling framework that achieves diversity comparable to baselines and improves temporal consistency and color naturalness in experiments.

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes