SE AIMay 28, 2025

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

arXiv:2505.22583v111.32 citationsh-index: 11Has CodeProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)

Originality Synthesis-oriented

AI Analysis

This addresses a gap in evaluating AI agents for software engineering workflows, specifically version control, but is incremental as it builds on existing benchmarks like SWE-bench.

The authors tackled the lack of benchmarks for AI agents in version control system tasks by introducing GitGoodBench, a novel benchmark covering core Git scenarios, and achieved a 21.11% solve rate with GPT-4o on a prototyping version.

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

View on arXiv PDF Code

Similar