DCAIJan 4, 2025

DeServe: Towards Affordable Offline LLM Inference via Decentralization

arXiv:2501.14784v16 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of affordable and scalable LLM inference for users and developers by reducing deployment costs, though it is incremental as it builds on existing serving systems with a decentralized approach.

The paper tackles the high costs and limited GPU availability for deploying open-source large language models (LLMs) by designing a decentralized offline serving system called DeServe, which utilizes idle GPU resources to achieve a 6.7x-12.6x improvement in throughput over existing baselines in high-latency network environments.

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in open-source LLMs have positioned them as strong contenders. However, deploying these models is often constrained by the high costs and limited availability of GPU resources. In response, this paper presents the design of a decentralized offline serving system for LLM inference. Utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost. DeServe specifically addresses key challenges in optimizing serving throughput in high-latency network environments. Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes