99.3DCMay 6
eLLM: Elastic Memory Management Framework for Efficient LLM ServingJiale Xu, Rui Zhang, Yi Xiong et al.
Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput. To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems. The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs.
95.0DCMay 12
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context InferenceDi Liu, Ruitian Wang, Chen Chen et al.
As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block size allocation across attention heads to improve accuracy. To compensate for the additional memory overhead, it further employs lossless block centroid quantization. In addition, custom GPU kernels are developed to support efficient execution with variable block sizes. Evaluation results demonstrate that AB-Sparse achieves an accuracy improvement of up to 5.43% over existing block sparse attention baselines without throughput overhead.
SPJul 31, 2020
Secrecy Outage Probability Analysis for RIS-Assisted NOMA SystemsLiang Yang, Yongjie Yuan
In this paper, the physical layer security (PLS) for a novel reconfigurable intelligent surface (RIS)-assisted non-orthogonal multiple access (NOMA) system in a multi-user scenario is investigated, where we consider the worst case that the eavesdropper also utilizes the advantage of the RISs. More specifically, we derive analytical results for the secrecy outage probability (SOP). From the numerical results, we observe that the use of RISs can improve the secrecy performance compared to traditional NOMA systems. However, for the worst case that the received signals at the eavesdropper comes from the RISs and source, increasing the number of intelligent elements on the RIS has a negative impact on the secrecy performance. At high SNRs, the system's SOP tends to a constant. Finally, the secrecy performance can be improved through the group selection.