CVDec 4, 2024

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

arXiv:2412.03735v264 citationsh-index: 14CVPR
Originality Incremental advance
AI Analysis

This addresses the issue of inaccurate content generation in video MLLMs for researchers and practitioners, but it is incremental as it builds on existing observations and methods.

The paper tackles the problem of hallucinations in multimodal large language models for video understanding by introducing VidHalluc, a benchmark with 5,002 videos to evaluate hallucinations across action, temporal sequence, and scene transition dimensions, and proposes DINO-HEAL, a training-free method that reduces hallucinations by an average of 3.02%.

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes