CVFeb 11

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

arXiv:2602.11244v1
Originality Incremental advance
AI Analysis

It reveals fundamental weaknesses in VidLMs for tasks requiring temporal and visual grounding, which is crucial for applications like video understanding and robotics, though it is incremental as it provides a diagnostic benchmark rather than a new solution.

The paper investigates whether Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion, finding they often fail, as shown by tests where models confidently describe reversed scenes as forward, answer questions while neglecting video, agree with false claims, struggle with camera motion, and fail with spatiotemporal occlusion, while humans succeed easily.

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes