LGAIDCNov 8, 2025

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

arXiv:2511.06010v12 citationsh-index: 5IEEE computer architecture letters
Originality Highly original
AI Analysis

This addresses a critical efficiency problem for LLM inference in applications with shared context, offering a scalable architectural solution.

The paper tackles the performance bottleneck of KV cache memory in long-sequence LLM inference by introducing MoSKA, which exploits context data heterogeneity to batch shared sequences, achieving up to 538.7x throughput increase in high-sharing workloads.

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes