Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions
This provides a benchmark for studying error interactions in hybrid human-machine code, which is incremental as it builds on an existing corpus.
The authors tackled the problem of evaluating how human and LLM errors interact in software development by constructing Tricky$^2$, a hybrid dataset that augments human-written defects with LLM-injected errors across multiple programming languages, enabling analysis of mixed-origin error behavior and multi-bug repair robustness.
Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky$^2$, a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline evaluations of classification, localization, and repair tasks.