SEAIPLJul 16, 2025

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

MILA
arXiv:2507.12367v25 citationsh-index: 26Has Code
Originality Incremental advance
AI Analysis

This addresses a critical issue for developers and AI tool users by providing an execution-based benchmark to improve adaptability in code generation, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of AI code generation struggling with Python library version incompatibilities by introducing GitChameleon 2.0, a dataset with 328 version-conditioned code completion problems and executable tests, finding that state-of-the-art models achieve only 48-51% success rates.

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes