DC AIMar 12

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

arXiv:2604.0961133.91 citationsh-index: 2

Predicted impact top 48% in DC · last 90 daysOriginality Incremental advance

AI Analysis

It addresses energy and performance issues for developers and system operators deploying LLMs in multi-request applications, but is incremental as it builds on existing serving systems.

This paper tackles the problem of high latency and energy demand in multi-request LLM inference workflows, revealing that batch size is the most impactful factor for performance-energy trade-offs, with benefits varying by workload, and that engine-level optimizations like vLLM improve GPU utilization while Parrot reduces energy under power constraints.

Large language models (LLMs) are increasingly used in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments. Using an NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we analyze how key energy knobs affect latency, throughput, and component-level energy use. Our findings reveal batch size as the most impactful lever, though benefits are workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for multi-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further show that engine-level optimizations in vLLM maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings offer actionable guidelines for developers and system operators designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.

View on arXiv PDF

Similar