Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, Zaiwen Wen

arXiv:2602.10450v15.83 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the bottleneck of lacking realistic benchmarks for industrial optimization modeling, enabling more accurate evaluation of AI systems in logistics, manufacturing, energy, and finance.

The authors tackled the problem of evaluating large language models for translating natural-language requirements into optimization formulations by creating MIPLIB-NL, a benchmark derived from real industrial-scale mixed-integer linear programs, which revealed significant performance degradation in existing systems compared to toy benchmarks.

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

View on arXiv PDF

Similar