DC AIJul 23, 2025

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving

Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye

arXiv:2507.17120v12.31 citationsh-index: 92025 IEEE International Conferences on Internet of Things (iThings) IEEE Green Computing & Communications (GreenCom) IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics)

Originality Incremental advance

AI Analysis

This addresses performance bottlenecks in LLM serving systems for applications requiring efficient, low-latency inference, representing an incremental improvement over existing batching methods.

The paper tackles inefficient GPU memory utilization and latency in LLM inference serving under heterogeneous workloads by introducing BucketServe, a bucket-based dynamic batching framework that groups requests by sequence length to minimize padding. Results show up to 3.58x higher throughput than UELLM and 1.93x more request load under SLO attainment compared to DistServe.

Large language models (LLMs) have become increasingly popular in various areas, traditional business gradually shifting from rule-based systems to LLM-based solutions. However, the inference of LLMs is resource-intensive or latency-sensitive, posing significant challenges for serving systems. Existing LLM serving systems often use static or continuous batching strategies, which can lead to inefficient GPU memory utilization and increased latency, especially under heterogeneous workloads. These methods may also struggle to adapt to dynamic workload fluctuations, resulting in suboptimal throughput and potential service level objective (SLO) violations. In this paper, we introduce BucketServe, a bucket-based dynamic batching framework designed to optimize LLM inference performance. By grouping requests into size-homogeneous buckets based on sequence length, BucketServe minimizes padding overhead and optimizes GPU memory usage through real-time batch size adjustments preventing out-of-memory (OOM) errors. It introduces adaptive bucket splitting/merging and priority-aware scheduling to mitigate resource fragmentation and ensure SLO compliance. Experiment shows that BucketServe significantly outperforms UELLM in throughput, achieving up to 3.58x improvement. It can also handle 1.93x more request load under the SLO attainment of 80% compared with DistServe and demonstrates 1.975x higher system load capacity compared to the UELLM.

View on arXiv PDF

Similar