SD AI ASJul 17, 2025

Voxtral

Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh

DeepMind

arXiv:2507.13264v130.726 citationsh-index: 27

Originality Incremental advance

AI Analysis

This addresses the need for efficient, open-source multimodal models for audio and text comprehension, though it appears incremental as an extension of existing chat model capabilities.

The authors tackled the problem of multimodal audio chat by introducing Voxtral Mini and Voxtral Small, which achieve state-of-the-art performance across diverse audio benchmarks while maintaining strong text capabilities, with Voxtral Small outperforming closed-source models and running locally.

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

View on arXiv PDF

Similar