SE CLApr 19

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia

arXiv:2604.1733888.2h-index: 5

Predicted impact top 10% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers of LLM-based coding assistants, this benchmark reveals that current models lack precise debugging capability, highlighting a critical bottleneck in code repair tasks.

The paper introduces the Precise Debugging Benchmark (PDB) to evaluate whether LLMs perform precise debugging or merely regenerate code. Frontier models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking achieve >76% unit-test pass rates but exhibit <45% precision, even with minimal debugging instructions, and iterative strategies fail to improve precision or recall.

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

View on arXiv PDF

Similar