CV AINov 21, 2025

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

arXiv:2511.17844v3

Originality Highly original

AI Analysis

This addresses the challenge of data acquisition for controllable text-to-video generation, offering a more efficient approach for researchers and practitioners in video synthesis.

The paper tackles the problem of fine-tuning text-to-video diffusion models for new generative controls like camera parameters, which typically requires large, high-quality datasets. It proposes a data-efficient strategy using sparse, low-quality synthetic data, showing it yields superior results compared to models fine-tuned on photorealistic data, with a framework to justify this phenomenon.

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

View on arXiv PDF

Similar