CVFeb 11

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Srinidhi Sunkara, Aditya Shanmugham, Rakesh Vaideeswaran, Abbaas Alif Mohamed Nishar

arXiv:2602.11244v12.8h-index: 48

Originality Incremental advance

AI Analysis

It reveals fundamental weaknesses in VidLMs for tasks requiring temporal and visual grounding, which is crucial for applications like video understanding and robotics, though it is incremental as it provides a diagnostic benchmark rather than a new solution.

The paper investigates whether Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion, finding they often fail, as shown by tests where models confidently describe reversed scenes as forward, answer questions while neglecting video, agree with false claims, struggle with camera motion, and fail with spatiotemporal occlusion, while humans succeed easily.

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

View on arXiv PDF

Similar