MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
This work addresses the problem of improving reasoning capabilities in language models for AI applications, representing an incremental advancement with specific optimizations.
The authors tackled the challenge of enhancing language models for reasoning tasks by optimizing both pre-training and post-training stages, resulting in MiMo-7B-RL outperforming larger models like 32B ones and OpenAI o1-mini on mathematics, code, and general reasoning tasks.
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.