DC AI CL NIOct 17, 2025

BeLLMan: Controlling LLM Congestion

Tella Rajashekhar Reddy, Atharva Deshmukh, Karan Tandon, Rohan Gandhi, Anjaly Parayil, Debopam Bhattacherjee

arXiv:2510.15330v11.2h-index: 10

Originality Incremental advance

AI Analysis

This addresses latency and energy issues in LLM infrastructure for users and operators, but it is incremental as it builds on existing control mechanisms for system load.

The paper tackled the problem of LLM applications causing high latency and poor user experience due to ignoring system load during token generation, and introduced beLLMan, a controller that adjusts output length based on load, resulting in up to 8x lower latency and 25% energy reduction while serving 19% more requests.

Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.

View on arXiv PDF

Similar