LGAICLApr 16, 2025

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

arXiv:2504.12526v11 citationsh-index: 19
Originality Highly original
AI Analysis

This addresses deployment challenges for long-context language models by eliminating prefill memory as the dominant bottleneck, potentially redirecting research priorities.

The paper tackles the high GPU memory demands during inference for long-context language models by proposing MOM, a method that partitions layers into mini-sequences and integrates with KV cache offloading. Results show MOM reduces peak memory usage by over 50% on average and extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU while maintaining identical outputs and accuracy.

Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes