MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios
This addresses a gap in software engineering for developers working with complex real-world code, though it is incremental as it extends existing debugging benchmarks to multi-library settings.
The authors tackled the lack of benchmarks for code debugging in multi-library Python scenarios by introducing MLDebugging, a comprehensive benchmark covering 126 libraries and seven issue types, and found that current LLMs struggle with this task.
Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.