CLAISEOct 10, 2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

PrincetonUW
arXiv:2310.06770v32478 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the challenge of effectively assessing language model capabilities for practical, autonomous applications in software engineering, representing an incremental step in evaluation methodology.

The paper tackles the problem of evaluating language models on real-world software engineering tasks by introducing SWE-bench, a framework with 2,294 issues from GitHub, and finds that even state-of-the-art models like Claude 2 can solve only 1.96% of these issues.

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Code Implementations8 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes