LGAIMay 22, 2025

Runtime Adaptive Pruning for LLM Inference

arXiv:2505.17138v44 citationsh-index: 8
Originality Highly original
AI Analysis

This addresses deployment challenges for LLMs by enabling adaptive compression to handle runtime memory variations and heterogeneous KV-cache demands, representing a novel approach rather than an incremental improvement.

The paper tackles the problem of high computational and memory requirements in LLM inference by proposing RAP, a runtime adaptive pruning framework that dynamically adjusts compression strategies using reinforcement learning, achieving superior performance over state-of-the-art baselines.

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes