ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
This addresses the problem of data contamination in LLM evaluation for researchers and practitioners, though it is incremental as it builds on existing OOD methods.
The authors tackled the challenge of evaluating large language models (LLMs) robustly by introducing ThinkBench, a framework that uses dynamic out-of-distribution data generation, resulting in an evaluation of 16 LLMs and 4 PRMs showing that most models are not robust and face data leakage issues.
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.