LGSep 23, 2025

Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models

arXiv:2509.19222v12 citationsh-index: 11Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the energy efficiency problem for developers and researchers deploying text-to-video models, but it is incremental as it focuses on characterization rather than proposing a new method.

The paper tackled the problem of high computational costs and poorly understood energy demands in text-to-video generation by systematically studying the latency and energy consumption of state-of-the-art open-source models, showing quadratic growth with spatial and temporal dimensions and linear scaling with denoising steps. It provided a benchmark reference and practical insights for designing more sustainable generative video systems.

Recent advances in text-to-video (T2V) generation have enabled the creation of high-fidelity, temporally coherent clips from natural language prompts. Yet these systems come with significant computational costs, and their energy demands remain poorly understood. In this paper, we present a systematic study of the latency and energy consumption of state-of-the-art open-source T2V models. We first develop a compute-bound analytical model that predicts scaling laws with respect to spatial resolution, temporal length, and denoising steps. We then validate these predictions through fine-grained experiments on WAN2.1-T2V, showing quadratic growth with spatial and temporal dimensions, and linear scaling with the number of denoising steps. Finally, we extend our analysis to six diverse T2V models, comparing their runtime and energy profiles under default settings. Our results provide both a benchmark reference and practical insights for designing and deploying more sustainable generative video systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes