LG AI CLMay 12

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

arXiv:2605.1882482.4Has Code

AI Analysis

For researchers and practitioners evaluating foundation models, this framework provides a way to generate more comprehensive and reliable benchmarks that reveal finer-grained model capabilities.

The paper introduces a framework for automated benchmark generation that produces fine-grained, contamination-robust benchmarks with broad coverage and rich metadata. Expert review shows significantly lower ground-truth error rates than MMLU and GSM8K, and evaluation of 12 models reveals performance differences not captured by existing benchmarks.

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

View on arXiv PDF

Similar