LGApr 13

SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Zikun Liu, Liang Luo, Qianru Li, Zhengyu Zhang, Wei Ling, Jingyi Shen, Zeliang Chen, Yaning Huang, Jingxian Huang, Abdallah Aboelela, Chonglin Sun, Feifan Gu

arXiv:2604.1211041.3h-index: 8

AI Analysis

For large-scale recommendation systems, SOLARIS allows deployment of computationally expensive foundation models in latency-critical online serving, improving business metrics without compromising inference speed.

SOLARIS enables real-time serving of large recommendation models by speculatively precomputing user-item embeddings, achieving a 0.67% revenue gain in Meta's advertising system.

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

View on arXiv PDF

Similar