CL AIJan 20, 2025

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang

arXiv:2501.11790v49.65 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the need for more reliable evaluation of LLMs' mathematical reasoning for researchers and developers, though it is incremental as it builds on existing benchmark concerns.

The authors tackled the problem of unreliable math benchmarks for large language models by introducing RV-Bench, a method using random variable questions to test genuine reasoning, finding that LLMs show a proficiency imbalance between seen and unseen data and limited generalization, though test-time scaling can help.

Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings propose that LLMs exhibit a proficiency imbalance between encountered and ``unseen'' data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified it can still be effectively elicited through test-time scaling.

View on arXiv PDF

Similar