SE AIJun 15, 2025

MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, Libo Qin

arXiv:2506.13824v111.34 citationsh-index: 20Has CodeACL

Originality Synthesis-oriented

AI Analysis

This addresses a gap in software engineering for developers working with complex real-world code, though it is incremental as it extends existing debugging benchmarks to multi-library settings.

The authors tackled the lack of benchmarks for code debugging in multi-library Python scenarios by introducing MLDebugging, a comprehensive benchmark covering 126 libraries and seven issue types, and found that current LLMs struggle with this task.

Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.

View on arXiv PDF Code

Similar