AICVAug 18, 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

arXiv:2508.12687v29 citationsh-index: 56Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the issue of unreliable MLLM outputs for researchers and developers in video understanding, though it is incremental as it focuses on benchmarking rather than solving hallucinations.

The paper tackles the problem of hallucinations in Multimodal Large Language Models (MLLMs) when processing egocentric videos by introducing EgoIllusion, a benchmark with 1,400 videos and 8,000 questions, revealing that models like GPT-4o and Gemini achieve only 59% accuracy.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EgoIllusion lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes