SE CLJun 13, 2025

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao

arXiv:2506.12278v112.66 citationsh-index: 2Has CodeACL

Originality Synthesis-oriented

AI Analysis

This addresses the need for systematic evaluation of LLMs in test-case generation for algorithm problems, which is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of evaluating LLMs' ability to generate high-quality test cases for algorithm problems by introducing TestCase-Eval, a benchmark with 500 problems and 100,000 human-crafted solutions, and finds that it provides a comprehensive assessment of 19 state-of-the-art LLMs on fault coverage and exposure tasks.

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

View on arXiv PDF

Similar