CVAIDec 12, 2024

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

arXiv:2412.09283v117 citationsh-index: 18CVPR
Originality Highly original
AI Analysis

This work addresses the challenge of generating high-fidelity videos from text for applications in media and AI, representing a novel method for a known bottleneck.

The paper tackles the problem of insufficient detail and hallucinations in video captions for text-to-video generation by proposing InstanceCap, an instance-aware structured caption framework, which significantly outperforms previous models in fidelity and reduces hallucinations.

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes