ET AIMar 10

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Amazon

arXiv:2603.09643v214.0h-index: 17

Predicted impact top 2% in ET · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses the problem of robust evaluation for multi-modal agents in customer experience management, building incrementally on prior research.

The paper tackles the lack of evaluation frameworks for multi-modal LLM agents that adapt to user personas, proposing the MM-tau-p^2 benchmark with 12 novel metrics to measure robustness and overhead in dual-control settings, showing that even advanced LLMs like GPT-5 and GPT 4.1 face challenges in multi-modality.

Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.

View on arXiv PDF

Similar