AI AR DCAug 1, 2024

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse

arXiv:2408.00741v132.9141 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses energy efficiency and cost issues for operators of LLM inference clusters, representing an incremental improvement in system optimization.

The paper tackles the problem of high energy consumption and carbon emissions in LLM inference clusters by proposing DynamoLLM, an energy-management framework that dynamically reconfigures clusters to optimize energy and cost while meeting performance SLOs, resulting in 53% energy savings, 38% reduction in carbon emissions, and 61% cost reduction.

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

View on arXiv PDF

Similar