Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes
This provides a competitive open model and practical training blueprint for practitioners aiming to scale reasoning capabilities under realistic compute constraints, though it is incremental as it builds on existing methods like SFT and RLFT.
The paper tackles the problem of model collapse and training instability in adapting language models for complex reasoning and long-context understanding, resulting in Motif-2-12.7B-Reasoning, a 12.7B parameter model that achieves performance comparable to larger models across mathematics, coding, and agentic benchmarks.
We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.