SECLMar 9, 2025

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

arXiv:2503.06680v244 citationsh-index: 7ACL
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of LLMs in automated software engineering for developers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating large language models (LLMs) for repository-level code generation in feature implementation by introducing FEA-Bench, a benchmark based on pull requests from 83 GitHub repositories, and found that LLMs perform significantly worse, highlighting challenges in incremental development.

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes