Magistral
This work provides a scalable RL pipeline for LLM training, which could benefit AI developers seeking more control over model training processes, though it appears incremental in its methodological approach.
The researchers tackled the challenge of training large language models using pure reinforcement learning without relying on existing implementations or distilled data, and demonstrated that their Magistral models maintain or improve capabilities like multimodal understanding and instruction following.
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.