IRLGFeb 20, 2025

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

arXiv:2502.14305v23 citationsh-index: 27EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of high computational requirements for deploying LLMs in recommendation systems, offering incremental improvements through compression and optimization strategies.

The paper tackles the impractical computational demands of large language models (LLMs) in real-world scenarios by presenting techniques like knowledge distillation and model compression to train and deploy small language models (SLMs) that retain quality while reducing costs and latency, with results applied to use cases in a professional social network platform.

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes