How to Get Your LLM to Generate Challenging Problems for Evaluation
This addresses the need for scalable and cost-effective evaluation methods for LLMs, though it is incremental as it builds on existing synthetic generation approaches.
The authors tackled the problem of evaluating large language models (LLMs) by introducing CHASE, a framework to synthetically generate challenging problems without human involvement, resulting in benchmarks where state-of-the-art LLMs achieved 40-60% accuracy.
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.