CVJul 24, 2025

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

arXiv:2507.18342v119 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in evaluating MLLMs for embodied agents and intelligent assistants by providing a benchmark for cross-view video understanding, though it is incremental as it builds on existing datasets and focuses on benchmarking rather than proposing new methods.

The paper tackles the problem of cross-view reasoning between first-person and third-person video perspectives in multimodal large language models (MLLMs) by introducing EgoExoBench, a benchmark with over 7,300 question-answer pairs, and finds that while MLLMs excel on single-view tasks, they struggle with semantic alignment, viewpoint association, and temporal reasoning in this context.

Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes