AIMar 26

Voxtral TTS

Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti

DeepMindTsinghua

arXiv:2603.2555181.2h-index: 53

Predicted impact top 35% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

This is an incremental improvement for multilingual voice cloning applications.

The paper tackles multilingual text-to-speech generation with minimal reference audio, achieving a 68.4% win rate over a competitor in human evaluations for voice cloning.

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

View on arXiv PDF

Similar