SEAICLLGMar 30, 2025

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

arXiv:2503.23466v11 citationsh-index: 5ICST
Originality Synthesis-oriented
AI Analysis

This dataset addresses the need for error-inducing test cases in software testing, particularly for AI-generated code, but is incremental as it compiles existing data from a public platform.

The authors tackled the problem of false negatives in software testing by curating Codehacks, a dataset of 288,617 adversarial test cases for 5,578 programming problems from Codeforces, to support testing of software synthesized from large language models.

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes