CLMar 27

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

arXiv:2603.2655752.8h-index: 18
AI Analysis

This addresses cost reduction for LLM inference in interactive settings, but it is incremental as it builds on retrieval-augmented generation with added features like memory growth and routing.

The paper tackles the high inference cost of LLMs in real-world services by proposing MemBoost, a memory-boosted framework that reuses answers and retrieves information for cheap inference, reducing expensive large-model invocations while maintaining answer quality comparable to a strong model baseline.

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes