CVAIMar 11

Are Video Reasoning Models Ready to Go Outside?

arXiv:2603.10652v196.5h-index: 3Has Code
Predicted impact top 3% in CV · last 90 daysOriginality Highly original
AI Analysis

This addresses robustness issues for real-world deployment of video reasoning models, representing an incremental advance with a novel method for a known bottleneck.

The paper tackles the problem of vision-language models degrading under real-world disturbances like weather and occlusion, proposing ROVA, a training framework that improves robustness by modeling a robustness-aware consistency reward, resulting in at least 24% relative accuracy boost and over 9% reasoning improvement compared to baselines.

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes