OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge

arXiv:2601.19924v11.63 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of LLMs in optimization tasks, providing actionable insights for developers, but it is incremental as it builds on existing benchmarks and methods.

The authors tackled the problem of understanding the limits of LLMs in optimization modeling by proposing OPT-ENGINE, a benchmark framework with scalable difficulty, and found that tool-integrated reasoning with external solvers is more robust as complexity increases, while pure-text reasoning plateaus, with constraint formulation being the main bottleneck.

Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.

View on arXiv PDF Code

Similar