CVCLMay 26

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

arXiv:2605.2710168.5
Predicted impact top 45% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and developers of VideoLLMs, this work identifies a critical failure mode in temporal grounding that undermines robust video understanding.

VideoLLMs hallucinate interactions between entities from different video segments, exhibiting a systematic 'bag-of-events' behavior where they process videos as collections of events rather than temporally structured sequences. All 11 evaluated models showed substantial BoE behavior, indicating a lack of reliable temporal grounding.

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes