SEAICLJan 9, 2024

DebugBench: Evaluating Debugging Capability of Large Language Models

Tencent
arXiv:2401.04621v395 citationsh-index: 41Has CodeACL
Originality Synthesis-oriented
AI Analysis

This work addresses the need for a robust benchmark to assess LLM debugging performance, which is crucial for developers and researchers, but it is incremental as it builds on existing evaluation methods by scaling up and diversifying bug types.

The paper tackles the problem of evaluating the debugging capability of large language models (LLMs), which is under-explored compared to coding, by introducing DebugBench, a benchmark with 4,253 instances covering multiple bug types and languages, and finds that closed-source models perform worse than humans while open-source models have lower pass rates.

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes