AIMay 9

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen

arXiv:2605.0890597.7

AI Analysis

For researchers and practitioners working on LLM reasoning and optimization, this work provides a benchmark and training method that goes beyond binary correctness to measure solution quality, with demonstrated transfer to diverse tasks.

The paper introduces OPT-BENCH, a framework for training and evaluating LLMs on NP-hard optimization problems using quality-aware reinforcement learning. Their model, trained on Qwen2.5-7B, achieves 93.1% success rate and 46.6% quality ratio, significantly outperforming GPT-4o (29.6% SR, 14.6% QR), and also improves performance on other tasks like math (+2.2%) and instruction following (+6.1%).

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

View on arXiv PDF

Similar