LG AI CLOct 17, 2024

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan

arXiv:2410.13502v314.210 citationsh-index: 40ICLR

Originality Incremental advance

AI Analysis

This work addresses the challenge of out-of-distribution evaluation for LLMs in mathematical reasoning, providing a framework to study generalization, but it is incremental as it builds on existing benchmarks and focuses on a specific domain.

The authors tackled the problem of evaluating large language models on arithmetic problems with arbitrarily complex proofs, finding that model performance significantly decreases as proof depth and width increase, especially for nonlinear structures, with models also being sensitive to sentence ordering changes.

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

View on arXiv PDF

Similar