SWE-Bench-CL: Continual Learning for Coding Agents
This provides a tool for developing adaptive AI agents in software engineering, but it is incremental as it builds on existing datasets and methods.
The paper tackles the problem of evaluating continual learning in coding agents by introducing SWE-Bench-CL, a benchmark based on chronologically ordered GitHub issues, and reports results including specialized metrics like average accuracy and forgetting to assess agent performance.
Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce SWE-Bench-CL, a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset introduced by OpenAI and Princeton-NLP in 2024. By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent's ability to accumulate experience, transfer knowledge across tasks, and resist catastrophic forgetting. We complement the dataset with (i) a preliminary analysis of inter-task structural similarity and contextual sensitivity, (ii) an interactive LangGraph-based evaluation framework augmented with a FAISS-backed semantic memory module, and (iii) a suite of specialized continual learning metrics -- including average accuracy, forgetting, forward/backward transfer, tool-use efficiency, and a generalized Composite Continual Learning Score and CL-F-beta score -- to capture the stability-plasticity trade-off. We outline a rigorous experimental protocol comparing memory-enabled and memory-disabled agents across diverse Python repositories. All code and data are publicly available at https://github.com/thomasjoshi/agents-never-forget, providing the community with a reproducible platform for developing more adaptive and robust AI agents in software engineering.