AIMay 27

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

arXiv:2605.2782088.9h-index: 3
Predicted impact top 22% in AI · last 90 daysOriginality Highly original
AI Analysis

For researchers developing AI agents that require joint multimodal perception, tool invocation, and dynamic interaction, EgoBench provides a challenging benchmark that exposes current capability bottlenecks.

EgoBench introduces the first interactive multimodal benchmark for tool-using agents, comprising 1,045 egocentric-video-grounded tasks across four daily scenarios. The best model achieves only 30.62% accuracy in the best scenario and 19.43% average across all scenarios, revealing a severe performance ceiling.

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes