Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
This addresses the problem of accurate user intent interpretation for researchers and practitioners in video generation, offering a novel method that enhances controllability, though it is incremental as it builds on existing multimodal models and video generators.
The paper tackles the bottleneck of interpreting user intents for controllable video generation by introducing Any2Caption, a framework that decouples condition interpretation from video synthesis, leveraging multimodal large language models to convert diverse inputs into structured captions; it shows significant improvements in controllability and video quality across existing models.
To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/