A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
This work addresses the robustness of large-scale V+L pre-trained models, which is a critical concern for their reliable deployment in real-world applications, especially for those who rely on these models for multimodal understanding.
This paper investigates the robustness of vision-and-language (V+L) pre-trained models across four types of robustness: Linguistic Variation, Logical Reasoning, Visual Content Manipulation, and Answer Distribution Shift. They found that pre-trained V+L models are inherently more robust than task-specific methods and propose Mango, a Multimodal Adversarial Noise GeneratOr, which achieves new state-of-the-art on 7 out of 9 robustness benchmarks.
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.