IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
This work addresses the need for realistic evaluation of AI agents in software development, providing a benchmark for developers and researchers, though it is incremental as it builds on existing IDE agent concepts.
The authors tackled the problem of evaluating large language models as IDE agents on real-world software engineering tasks by introducing IDE-Bench, a framework that uses a Dockerized test harness with structured tools to assess agents across 80 tasks in eight unpublished repositories, achieving systematic correlation of agent intent with successful project modifications.
IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.