CLFeb 22, 2025

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

arXiv:2502.16268v117 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the problem of data contamination in LLM evaluation for researchers and practitioners, though it is incremental as it builds on existing OOD methods.

The authors tackled the challenge of evaluating large language models (LLMs) robustly by introducing ThinkBench, a framework that uses dynamic out-of-distribution data generation, resulting in an evaluation of 16 LLMs and 4 PRMs showing that most models are not robust and face data leakage issues.

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes