CVJan 23, 2023

Zorro: the masked multimodal transformer

DeepMind
arXiv:2301.09595v225 citationsh-index: 188
AI Analysis

This addresses the need for independent audio and visual features in contrastive self-supervised learning and enables unimodal inference, though it is incremental as it builds on existing transformer architectures.

The paper tackled the problem of fully entangled multimodal representations in attention-based models by introducing Zorro, a masking technique that keeps parts of the representation modality-pure, achieving state-of-the-art results on AudioSet and VGGSound benchmarks.

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes