CVMay 17

Image-to-Video Diffusion: From Foundations to Open Frontiers

arXiv:2605.1724898.4
Predicted impact top 3% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers in generative models, this survey fills a gap by offering a structured overview of I2V generation, which is increasingly important for applications requiring content consistency and motion coherence.

This survey provides a dedicated taxonomy and systematic analysis of diffusion-based image-to-video (I2V) generation, reviewing task formulation, architectures, datasets, metrics, and core designs, while identifying open challenges.

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes