SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
This work addresses computational inefficiency and error accumulation in reasoning processes for LLM users, representing an incremental improvement over existing post-training methods.
The paper tackles the problem of inefficient reasoning in large language models by proposing SSPO, a method that reduces computational overhead and improves accuracy by optimizing each reasoning step without auxiliary models, achieving both accurate and succinct reasoning sequences across diverse domains and languages.
Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.