OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
This addresses the need for flexible and scalable video generation and understanding tools for applications like video-to-video translation and scene reconstruction, representing a novel method for a known bottleneck.
The paper tackles the problem of controllable video diffusion by proposing OmniVDiff, a framework that synthesizes and comprehends multiple video visual modalities in a single model, achieving state-of-the-art performance in video generation and competitive results in video understanding.
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. Our framework supports three key capabilities: (1) Text-conditioned video generation, where all modalities are jointly synthesized from a textual prompt; (2) Video understanding, where structural modalities are predicted from rgb inputs in a coherent manner; and (3) X-conditioned video generation, where video synthesis is guided by finegrained inputs such as depth, canny and segmentation. Extensive experiments demonstrate that OmniVDiff achieves state-of-the-art performance in video generation tasks and competitive results in video understanding. Its flexibility and scalability make it well-suited for downstream applications such as video-to-video translation, modality adaptation for visual tasks, and scene reconstruction.