CLAIAug 24, 2025

Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

arXiv:2508.17393v1h-index: 3Has Code
Originality Incremental advance
AI Analysis

This addresses the need for efficient and scalable testing methods for developers of conversational AI agents, though it is incremental as it builds on existing techniques like LLM-as-a-Judge.

The paper tackles the problem of evaluating conversational AI agents by introducing the Agent-Testing Agent (ATA), a meta-agent that automates testing through adaptive test generation and LLM-based scoring, resulting in more diverse and severe failures detected in 20-30 minutes compared to days for human annotators.

LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open-source implementation for reproducible agent testing: https://github.com/KhalilMrini/Agent-Testing-Agent

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes