How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
It addresses the growing environmental sustainability problem for AI developers and policymakers by providing a standardized benchmarking tool, though it is incremental in applying existing methods to new data.
This paper tackles the problem of quantifying the environmental impact of LLM inference by introducing a benchmarking framework that measures energy, water, and carbon footprints across 30 models, revealing that the most energy-intensive models exceed 29 Wh per prompt and scale to impacts comparable to 35,000 U.S. homes annually.
This paper introduces an infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models in commercial datacenters. The framework combines public API performance data with company-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost and provide a dynamically updated dashboard that visualizes model-level energy, water, and carbon metrics. Results show the most energy-intensive models exceed 29 Wh per long prompt, over 65 times the most efficient systems. Even a 0.42 Wh short query, when scaled to 700M queries/day, aggregates to annual electricity comparable to 35{,}000 U.S. homes, evaporative freshwater equal to the annual drinking needs of 1.2M people, and carbon emissions requiring a Chicago-sized forest to offset. These findings highlight a growing paradox: as AI becomes cheaper and faster, global adoption drives disproportionate resource consumption. Our methodology offers a standardized, empirically grounded basis for sustainability benchmarking and accountability in AI deployment.