VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
This addresses the need for efficient, real-time TTS systems, particularly for applications requiring immediate audio feedback, though it appears incremental as it builds on existing streaming TTS methods.
The authors tackled the problem of achieving low-latency real-time text-to-speech synthesis by developing VoXtream, a fully autoregressive streaming system that starts speaking from the first word, resulting in an initial delay of 102 ms on GPU while matching or surpassing larger baselines in quality.
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.