CL AIDec 22, 2024

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang

arXiv:2412.17874v28.212 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of assessing LLM generalization for researchers and practitioners in operations research, though it is incremental as it applies existing evaluation methods to a new domain-specific dataset.

The paper introduces ORQA, a benchmark to evaluate how well large language models (LLMs) generalize to operations research (OR) by testing their reasoning on complex optimization problems, finding that models like LLaMA 3.1 show modest performance, indicating a gap in specialized domain generalization.

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.

View on arXiv PDF Code

Similar