CLMay 30, 2025

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li

arXiv:2505.24098v118.820 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of reliable verification for LLM coding tasks, particularly in competitive programming, though it is incremental as it builds on existing test synthesis approaches.

The paper tackles the challenge of synthesizing high-quality test cases for verifying LLM-generated code in competitive programming, resulting in a dataset with 47k problems and tests that show improvements of up to 40 percentage points in precision over existing methods.

Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

View on arXiv PDF

Similar