CLJan 2

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

Berkeley

arXiv:2601.00575v11.11 citationsh-index: 26

Originality Incremental advance

AI Analysis

This work addresses the problem of benchmark contamination and manual creation costs for LLM evaluators, offering a scalable solution, though it is incremental as it builds on existing methods like genetic algorithms.

The paper tackles the challenge of efficiently creating novel and diverse benchmarks for evaluating large language models (LLMs) in reasoning and code generation, introducing InfoSynth, a framework that automatically synthesizes Python coding problems with 97% accuracy in generating test cases and solutions while achieving higher novelty and diversity than seed datasets.

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

View on arXiv PDF

Similar