CVFeb 28, 2025

Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

arXiv:2502.21314v23 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses dataset quality and computational resource issues in text-to-video generation for AI researchers and practitioners, representing an incremental improvement through integrated data curation and model design.

The paper tackled limitations in text-to-video generation by introducing a high-quality dataset (CFC-VIDS-1M) and a transformer-based model (RACCOON) with a progressive training strategy, resulting in visually appealing and temporally coherent videos while maintaining computational efficiency.

Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes