LG AIFeb 17, 2025

Thinking Preference Optimization

Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han

arXiv:2502.13173v115.76 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the cost and plateau issues in fine-tuning for reasoning tasks, offering a method to enhance existing models without additional data collection.

The paper tackles the problem of improving long chain-of-thought reasoning in small LLMs without needing new high-quality data, by proposing Thinking Preference Optimization (ThinkPO), which uses short CoT responses as rejected answers and long ones as chosen answers to boost performance, resulting in an 8.6% increase in math reasoning accuracy and 25.9% longer outputs.

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.

View on arXiv PDF Code

Similar