Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
This reduces data requirements for image editing models, benefiting researchers and developers in computer vision, though it is incremental as it builds on existing temporal modeling ideas.
The paper tackles the high cost of instruction-driven image editing by viewing it as a degenerate temporal process, enabling transfer of priors from video pre-training; it matches leading open-source baselines while using only about 1% of the supervision required by mainstream models.
We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.