CLOct 29, 2025

Serve Programs, Not Prompts

arXiv:2510.25412v14.91 citationsh-index: 5HotOS

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient and inflexible LLM serving for developers building complex LLM applications, representing a novel paradigm shift rather than an incremental improvement.

The paper tackles the inefficiency and inflexibility of current LLM serving systems by proposing a new architecture that serves programs instead of prompts, resulting in a system called Symphony that virtualizes KV cache and enables runtime customization for improved efficiency and extensibility.

Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

View on arXiv PDF

Similar