AIMay 11

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

arXiv:2605.1051665.1

Predicted impact top 57% in AI · last 90 daysOriginality Incremental advance

AI Analysis

It provides a rigorous measurement science for evaluating agent robustness, crucial for high-stakes deployment.

This paper introduces a statistical framework for quantifying AI agent reliability using U-statistics and kernel-based metrics, demonstrating that trajectory-level consistency offers greater diagnostic sensitivity than pass@1 rates across three benchmarks.

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

View on arXiv PDF

Similar