SEAIMay 28, 2025

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

arXiv:2505.22583v12 citationsh-index: 11Has CodeProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
Originality Synthesis-oriented
AI Analysis

This addresses a gap in evaluating AI agents for software engineering workflows, specifically version control, but is incremental as it builds on existing benchmarks like SWE-bench.

The authors tackled the lack of benchmarks for AI agents in version control system tasks by introducing GitGoodBench, a novel benchmark covering core Git scenarios, and achieved a 21.11% solve rate with GPT-4o on a prototyping version.

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes