CLJul 27, 2023
PanGu-Coder2: Boosting Large Language Models for Code with Ranking FeedbackBo Shen, Jiaxin Zhang, Taihong Chen et al.
Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc. In this paper, we propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark. Furthermore, through an extensive evaluation on CoderEval and LeetCode benchmarks, we show that PanGu-Coder2 consistently outperforms all previous Code LLMs.
SEAug 26, 2024
SWE-bench-java: A GitHub Issue Resolving Benchmark for JavaDaoguang Zan, Zhirong Huang, Ailun Yu et al.
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.
CLJun 3, 2024
CodeR: Issue Resolving with Multi-Agent and Task GraphsDong Chen, Shaoxin Lin, Muhan Zeng et al.
GitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench is proposed to measure the performance in resolving issues. In this paper, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve 28.33% of issues, when submitting only once for each issue. We examine the performance impact of each design of CodeR and offer insights to advance this research direction.
SEJun 28, 2017
Identifying Patch Correctness in Test-Based Program RepairYingfei Xiong, Xinyuan Liu, Muhan Zeng et al.
Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGenProg, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches to be generated, without blocking any correct patches.