SE AISep 4, 2025

RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

Jingjing Liu, Zeming Liu, Zihao Cheng, Mengliang He, Xiaoming Shi, Yuhang Guo, Xiangrong Zhu, Yuanfang Guo, Yunhong Wang, Haifeng Wang

arXiv:2509.04078v23 citationsh-index: 7EMNLP

Originality Incremental advance

AI Analysis

This addresses a gap in benchmarking LLMs for real-world software development debugging, though it is incremental as it builds on existing repository-level datasets by expanding diversity.

The paper tackles the problem of evaluating large language models (LLMs) in repository-level code debugging, which is more complex and realistic than function-level scenarios, by introducing RepoDebug, a multi-task and multi-language dataset with 22 error subtypes across 8 languages and 3 tasks, and finds that even the best-performing model, Claude 3.5 Sonnet, performs poorly in this setting.

Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM's function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM's challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.

View on arXiv PDF

Similar