Voxtral Realtime
This addresses the problem of low-latency speech transcription for applications requiring real-time processing, representing a strong specific gain rather than an incremental improvement.
The authors tackled real-time automatic speech recognition by developing Voxtral Realtime, a natively streaming model that matches offline transcription quality at sub-second latency, achieving performance on par with Whisper at a delay of 480ms.
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.