SDCLASJun 14, 2025

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

arXiv:2506.12570v17 citationsh-index: 12IEEE Signal Processing Letters
Originality Incremental advance
AI Analysis

This addresses the need for efficient real-time TTS for applications like speech large language models, though it appears incremental as it builds on existing zero-shot TTS methods.

The paper tackles the problem of real-time zero-shot text-to-speech synthesis by proposing StreamMel, a single-stage streaming framework that models continuous mel-spectrograms, achieving performance comparable to offline systems with low latency.

Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes