CL AIMay 2, 2025

PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, Dilek Hakkani-Tür

arXiv:2505.01592v18.32 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation methods for interactive planning agents in AI, though it is incremental as it builds on existing POMDP frameworks to improve diagnostic capabilities.

The paper tackles the problem that existing benchmarks for interactive planning agents focus only on task completion, which may not align with user satisfaction, by proposing PIPA, a unified evaluation protocol based on a POMDP paradigm to assess agent performance through atomic criteria, showing that agents excel in different behavioral stages and user satisfaction depends on both outcomes and intermediate behaviors.

The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.

View on arXiv PDF

Similar