AICYLGFeb 18

Towards a Science of AI Agent Reliability

arXiv:2602.16666v119 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the critical issue of AI agent reliability for safety-critical deployments, offering a more holistic evaluation framework, though it is incremental in refining existing assessment methods.

The paper tackles the problem that AI agents often fail in practice despite high benchmark scores, by proposing twelve metrics to decompose reliability into consistency, robustness, predictability, and safety. Evaluating 14 models across two benchmarks, it finds that recent capability gains yield only small reliability improvements.

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes