AIMar 26
Voxtral TTSAlexander H. Liu, Alexis Tacnet, Andy Ehrenberg et al. · deepmind, tsinghua
We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.
CLJan 13
Ministral 3Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian et al.
We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
LGMar 14, 2023
R2 Loss: Range Restriction Loss for Model Compression and QuantizationArnav Kundu, Chungkuk Yoo, Srijan Mishra et al.
Model quantization and compression is widely used techniques to reduce usage of computing resource at inference time. While state-of-the-art works have been achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To overcome the challenge, we focus on outliers in weights of a pre-trained model which disrupt effective lower bit quantization and compression. In this work, we propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training. By effectively restricting range of weights, we mold the overall distribution into a tight shape to ensure high quantization bit resolution, therefore allowing model compression and quantization techniques can to utilize their limited numeric representation powers better. We introduce three different, L-inf R2-Loss, its extension Margin R2-Loss and a new Soft-Min-MaxR2-Loss to be used as an auxiliary loss during full-precision model training. These R2-Loss can be used in different cases such as L-inf and Margin R2-Loss would be effective for symmetric quantization, while Soft-Min-Max R2-Loss shows better performance for model compression. In our experiment, R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques. With R2-Loss, MobileNet-V2 2bit weight and 8bit activation PTQ, MobileNet-V1 2bit weight and activation QAT, ResNet18 1bit weight compression are improved to 59.49% from 50.66%, 59.05% from 55.96%, and 52.58% from 45.54%, respectively.
CLJun 12, 2025Code
MagistralMistral-AI, Abhinav Rastogi, Albert Q. Jiang et al.
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
SEAug 8, 2025Code
Devstral: Fine-tuning Language Models for Coding Agent ApplicationsAbhinav Rastogi, Adam Yang, Albert Q. Jiang et al. · deepmind
We introduce Devstral-Small, a lightweight open source model for code agents with the best performance among models below 100B size. In this technical report, we give an overview of how we design and develop a model and craft specializations in agentic software development. The resulting model, Devstral-Small is a small 24B model, fast and easy to serve. Despite its size, Devstral-Small still attains competitive performance compared to models more than an order of magnitude larger.
SDJul 17, 2025
VoxtralAlexander H. Liu, Andy Ehrenberg, Andy Lo et al. · deepmind
We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.
AIFeb 11
Voxtral RealtimeAlexander H. Liu, Andy Ehrenberg, Andy Lo et al.
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
LGDec 2, 2024
Cross Domain Adaptation using Adversarial networks with Cyclic lossManpreet Kaur, Ankur Tomar, Srijan Mishra et al.
Deep Learning methods are highly local and sensitive to the domain of data they are trained with. Even a slight deviation from the domain distribution affects prediction accuracy of deep networks significantly. In this work, we have investigated a set of techniques aimed at increasing accuracy of generator networks which perform translation from one domain to the other in an adversarial setting. In particular, we experimented with activations, the encoder-decoder network architectures, and introduced a Loss called cyclic loss to constrain the Generator network so that it learns effective source-target translation. This machine learning problem is motivated by myriad applications that can be derived from domain adaptation networks like generating labeled data from synthetic inputs in an unsupervised fashion, and using these translation network in conjunction with the original domain network to generalize deep learning networks across domains.