AI LGFeb 16, 2025

OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen

Peking U

arXiv:2502.11102v228.535 citationsh-index: 3Has CodeICML

Originality Incremental advance

AI Analysis

This addresses data scarcity for researchers and practitioners in AI and optimization, enabling more robust modeling of practical problems from natural language, though it is incremental as it builds on existing methods for data synthesis.

The authors tackled the lack of high-quality datasets for training large language models on optimization modeling from natural language by proposing OptMATH, a scalable framework that synthesizes data with controllable complexity, resulting in models trained on it achieving superior performance on multiple benchmarks across sizes from 0.5B to 32B parameters.

Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.

View on arXiv PDF Code

Similar