RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
This addresses the robustness gap in MLLMs for video understanding, which is crucial for real-world applications, but it is incremental as it builds on existing benchmarks and methods.
The paper tackles the problem of evaluating the robustness of Multi-modal Large Language Models (MLLMs) against manipulated video content by introducing Ro-Bench, a benchmark with counterfactual video test sets, and finds that current models degrade substantially on it, but fine-tuning with counterfactual data improves performance by 21.73% on Ro-Bench and 12.78% across 20 tasks in MVBench.
Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.