LGApr 10, 2023

On Robustness in Multimodal Learning

Brandon McKinzie, Joseph Cheng, Vaishaal Shankar, Yinfei Yang, Jonathon Shlens, Alexander Toshev

arXiv:2304.04385v25.34 citationsh-index: 59

Originality Incremental advance

AI Analysis

This addresses robustness issues in multimodal learning for applications on hardware platforms, but it is incremental as it builds on existing methods with specific improvements.

The paper tackled the problem of multimodal models behaving differently when modalities vary between training and deployment, proposing a robustness framework and interventions that achieved 1.5x-4x robustness improvements on datasets like AudioSet and Kinetics-400, with competitive results such as 44.2 mAP on AudioSet 20K.

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.

View on arXiv PDF

Similar