CVNov 24, 2025

Are Image-to-Video Models Good Zero-Shot Image Editors?

arXiv:2511.19435v21 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient and effective image editing for users by repurposing existing video models, though it is incremental as it builds on pretrained models with novel adaptations.

The paper tackles the problem of using image-to-video diffusion models for zero-shot image editing by introducing IF-Edit, a tuning-free framework that addresses challenges like prompt misalignment and blurry frames, resulting in strong performance on reasoning-centric tasks and competitive results on general-purpose edits across four benchmarks.

Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes