SE AI CLJan 9, 2024

DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun

Tencent

arXiv:2401.04621v331.697 citationsh-index: 41Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the need for a robust benchmark to assess LLM debugging performance, which is crucial for developers and researchers, but it is incremental as it builds on existing evaluation methods by scaling up and diversifying bug types.

The paper tackles the problem of evaluating the debugging capability of large language models (LLMs), which is under-explored compared to coding, by introducing DebugBench, a benchmark with 4,253 instances covering multiple bug types and languages, and finds that closed-source models perform worse than humans while open-source models have lower pass rates.

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

View on arXiv PDF Code

Similar