CVAILGNov 25, 2024

Towards Precise Scaling Laws for Video Diffusion Transformers

arXiv:2411.17470v214 citationsh-index: 20CVPR
Originality Incremental advance
AI Analysis

This work addresses the high training costs of video diffusion models by providing a method to optimize performance and efficiency, though it is incremental as it extends scaling laws from language to video models.

The paper tackles the problem of optimizing video diffusion transformers by deriving precise scaling laws to predict optimal model size and hyperparameters, achieving a 40.1% reduction in inference costs while maintaining comparable performance under a compute budget of 1e10 TFlops.

Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes