AI DC MAFeb 7, 2025

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano

IBM

arXiv:2502.05352v127.939 citationsh-index: 35Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for systematic evaluation of AI agents in IT automation for researchers and practitioners, though it is incremental as it builds on existing benchmarking concepts.

The authors tackled the problem of evaluating AI agents for IT automation by introducing ITBench, a benchmarking framework with 94 real-world scenarios across SRE, CISO, and FinOps, and found that state-of-the-art models resolved only 13.8% of SRE, 25.2% of CISO, and 0% of FinOps scenarios.

Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.

View on arXiv PDF Code

Similar