CLApr 4, 2024

Towards Pareto Optimal Throughput in Small Language Model Serving

arXiv:2404.03353v313 citationsh-index: 4EuroMLSys@EuroSys
AI Analysis

This work addresses resource-constrained users by optimizing SLM serving, but it is incremental as it builds on existing SLM opportunities with experimental benchmarking.

The paper tackles the problem of benchmarking Small Language Model (SLM) inference for performance and energy efficiency, finding that their small memory footprint enables Pareto-optimal throughput on a single accelerator, with model replication improving resource utilization.

Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes