SE AI CLAug 26, 2024

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng

arXiv:2408.14354v118.044 citationsh-index: 44

Originality Synthesis-oriented

AI Analysis

This work addresses the need for broader programming language support in automated software engineering benchmarks, which is incremental as it extends an existing benchmark to Java.

The paper tackles the lack of a multilingual benchmark for evaluating large language models in GitHub issue resolving by developing SWE-bench-java, a Java version of the existing Python benchmark, and releases it with an evaluation environment and leaderboard, testing it with methods like SWE-agent and several LLMs.

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

View on arXiv PDF

Similar