DCAICLNIOct 17, 2025

BeLLMan: Controlling LLM Congestion

arXiv:2510.15330v1h-index: 10
Originality Incremental advance
AI Analysis

This addresses latency and energy issues in LLM infrastructure for users and operators, but it is incremental as it builds on existing control mechanisms for system load.

The paper tackled the problem of LLM applications causing high latency and poor user experience due to ignoring system load during token generation, and introduced beLLMan, a controller that adjusts output length based on load, resulting in up to 8x lower latency and 25% energy reduction while serving 19% more requests.

Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes