CVMMJul 5, 2022

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

arXiv:2207.02159v436 citationsh-index: 34
Originality Synthesis-oriented
AI Analysis

This work addresses the robustness problem for researchers and practitioners in multi-modal AI by providing benchmark datasets and initial findings, though it is incremental as it focuses on analysis rather than new methods.

The authors tackled the lack of robustness studies in video-language models by conducting the first extensive analysis against real-world perturbations, revealing that models are more susceptible to video perturbations than text ones, with pre-trained models showing greater robustness.

Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are generally more susceptible when only video is perturbed as opposed to when only text is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language learning. The benchmark introduced in this study along with the code and datasets is available at https://bit.ly/3CNOly4.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes