LGAIApr 24

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

arXiv:2604.2288128.0
Predicted impact top 3% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying generative recommendation models, MTServe significantly reduces inference latency and GPU memory pressure.

Generative recommendation models suffer from high inference costs due to repeated encoding of long user histories. MTServe introduces a hierarchical cache system that virtualizes GPU memory using host RAM, achieving up to 3.1x speedup with >98.5% cache hit ratio.

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes