CV LGMay 27, 2025

Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg

arXiv:2505.20629v16.21 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the need for flexible and resource-efficient controllable video generation for AI and creative applications, though it is incremental as it builds on existing T2V models.

The paper tackles the problem of text-image-to-video generation by proposing FlexTI2V, a training-free method that conditions text-to-video models on arbitrary images at any positions, surpassing previous training-free methods by a notable margin.

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.

View on arXiv PDF

Similar