LGOct 8, 2025

PEAR: Planner-Executor Agent Robustness Benchmark

arXiv:2510.07505v21 citationsh-index: 60
Originality Incremental advance
AI Analysis

This work addresses the lack of holistic understanding of vulnerabilities in multi-agent systems, providing a benchmark for researchers and practitioners, though it is incremental as it builds on existing planner-executor architectures.

The paper tackles the problem of adversarial vulnerabilities in LLM-based multi-agent systems by introducing PEAR, a benchmark for evaluating utility and robustness, finding that attacks on the planner are most effective and revealing a trade-off between performance and robustness.

Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes