CLMay 15

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

arXiv:2605.1655184.9
Predicted impact top 44% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For developers of LLM-based agents, PQR reduces the human effort needed to identify meaningful failure cases by automatically generating realistic, failure-triggering queries.

PQR is a framework that automatically generates diverse and realistic user queries to uncover failures in LLM-based agents, specifically targeting unhelpful responses in an e-commerce QA agent. It discovers 23%–78% more unhelpful responses than prior methods, with higher diversity and realism.

Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes