Compass-Thinker-7B Technical Report
This work addresses the resource-intensive nature of RL for large models, offering insights for more efficient training in AI reasoning, though it is incremental as it builds on existing RL methods.
The researchers tackled the high computational costs of reinforcement learning (RL) for large language models by proposing Compass-Thinker-7B, a model trained with a specialized RL pipeline on 30k mathematics problems, achieving 40% accuracy on the challenging AIME2024 evaluation.
Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core technology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We propose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learning with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline. We curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with different difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model. Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.