SECLJun 13, 2025

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

arXiv:2506.12278v15 citationsh-index: 2Has CodeACL
Originality Synthesis-oriented
AI Analysis

This addresses the need for systematic evaluation of LLMs in test-case generation for algorithm problems, which is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of evaluating LLMs' ability to generate high-quality test cases for algorithm problems by introducing TestCase-Eval, a benchmark with 500 problems and 100,000 human-crafted solutions, and finds that it provides a comprehensive assessment of 19 state-of-the-art LLMs on fault coverage and exposure tasks.

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes