ARMar 25

HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD

arXiv:2602.1875082.2h-index: 4

Predicted impact top 6% in AR · last 90 daysOriginality Highly original

AI Analysis

This enables efficient, low-latency, and privacy-preserving long-context LLM inference on memory-constrained AI Personal Computers, addressing a critical deployment challenge.

The paper tackles the memory bottleneck of long-context LLM inference on edge devices by proposing HillInfer, a framework that offloads lightweight token importance evaluation to a Computational Storage Drive, achieving up to an 8.56× speedup over state-of-the-art baselines without accuracy loss.

Deploying Large Language Models (LLMs) on memory-constrained AI Personal Computers (AIPCs) enables low-latency, privacy-preserving inference, but long-context generation is fundamentally bottlenecked by the linearly growing Key-Value (KV) cache. While dynamic KV eviction mitigates this memory wall, existing offloading strategies either trigger crippling PCIe I/O bottlenecks on standard SSDs or suffer from FPGA resource exhaustion by forcing compute-intensive exact attention on a single, weak Computational Storage Drive (CSD). In this paper, we propose HillInfer, a CSD-assisted KV eviction framework that introduces a paradigm shift: offloading strictly lightweight token importance evaluation to a single CSD (e.g., SmartSSD) on AIPCs. To fully capitalize on this lightweight offloading strategy, HillInfer orchestrates a Hierarchical KV Cache Manager (HKM) that leverages temporal locality and dynamic token hit rates to physically partition cache pools, thereby eliminating cross-device I/O thrashing. Additionally, we design an Adaptive Prefetch-based Pipeline (APP) that adaptively balances the evaluation workload between the host CPU and the SmartSSD, effectively masking the heterogeneous straggler effect. Finally, we introduce a CSD-based Evaluation Configuration (CEC) to enable resource-efficient near-data processing on the FPGA. Extensive experiments on a commodity AIPC demonstrate that HillInfer achieves up to an 8.56$\times$ speedup over state-of-the-art baselines, delivering low-latency, I/O-efficient long-context inference without sacrificing model accuracy.

View on arXiv PDF

Similar