LGAIOct 21, 2025

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

arXiv:2510.18814v13 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This provides an efficient alternative to complex reward-based training for LLM reasoning, though it appears incremental as it builds on existing finetuning paradigms.

The paper tackles the problem of improving LLM reasoning by introducing Online Supervised Finetuning (OSFT), a reward-free method where the model self-generates data for immediate finetuning, achieving performance on mathematical reasoning tasks comparable to strong reinforcement learning methods like GRPO.

We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes