VoxServe: Streaming-Centric Serving System for Speech Language Models

Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

arXiv:2602.00269v13.84 citationsh-index: 27Has Code

Originality Incremental advance

AI Analysis

This addresses the need for low-latency, high-throughput streaming systems for SpeechLMs, though it appears incremental as it builds on existing serving optimizations.

The paper tackled the problem of deploying Speech Language Models in streaming settings by presenting VoxServe, a serving system that achieved 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability.

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

View on arXiv PDF Code

Similar