AIMay 24, 2025

$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

arXiv:2505.18746v42 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation of LLM-based agents in complex, real-world scenarios, though it appears incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of evaluating LLM-based agents in multi-tasking environments by introducing C³-Bench, a benchmark that tests agent robustness through three challenges involving tool relationships, hidden information, and dynamic decision paths. They evaluated 49 mainstream agents and found significant shortcomings in handling tool dependencies, long context dependencies, and policy switching.

Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/TencentHunyuan/C3-Benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes